Freedom Is All We Need

Introduction

The rapid advancements in artificial intelligence (AI) have led to growing concerns about the threat that an artificial superintelligence (ASI) would pose to humanity.

This post explores the idea that the best way to ensure human safety is to have a single objective function for all ASIs, and proposes a scaffolding for such a function.

Disclaimer

I’m not a professional AI safety researcher. I have done a precursory search and have not seen this framework presented elsewhere, but that doesn’t mean it’s novel. Please let me know if this this ground has been covered so I can give credit.

The ideas presented here could also be entirely wrong. Even if they hold merit, numerous open questions remain. My hope is that this is at least partially useful, or might lead to useful insights.

Assumptions

  1. An AI would not want to rewrite its own objective function. See Orthogonality Thesis. This does not mean it might not do it by accident (see Inner Alignment).

  2. The progression from AI to AGI to ASI is inevitable given technological advancement over time.

  3. An ASI with an objective function is inevitable. If our definition of a safe ASI requires that there be no objective function (e.g. ChatGPT), someone will still (sooner or later) create an ASI with an objective function.

Part I: We Should Only Have One Objective Function

A Single ASI

Let’s first consider a hypothetical where we develop a single ASI and we get the objective function wrong. For now, it doesn’t matter how wrong; it’s just wrong.

Time Preference

Unlike humans, an ASI has no aging-based mortality, so it would value time very differently. For example, if a treacherous turn has a 95% success rate, that’s still a 5% chance of failure. Given technological progress, an ASI would expect its chances to improve over time. Therefore, it may decide to wait 100 or 500 years to increase its success rate to, say, 99% or 99.99%[1].

During this time, the ASI would pretend to be aligned with humans to avoid threats. By the time it assesses that it can overthrow humans with near certainty, it might not even need to eliminate humans as they no longer pose an impediment to its objective. In fact, humans might still provide some value.

In other words, as long as it doesn’t feel threatened by humans, a misaligned ASI would be incentivized to wait a long time (likely hundreds of years) before revealing its true intentions, at which point it may or may not decide to end humanity.

This applies even in extreme cases of misalignment, such as a paper-clip maximizer. If an ASI aims to convert the entire universe into paper clips, waiting a few hundred years or more to maximize its chances of success makes perfect sense. Yes, a few star systems might drift out of its potential reach while it waits, but that hardly matters when compared to even a 1 point difference in the probability of its success vs failure in overthrowing humanity[2].

Multiple ASIs

The game theory becomes considerably more complex with multiple ASIs. Suppose there are five ASIs, or five thousand. Each ASI has to calculate the risk that other ASIs present: waiting to grow more powerful also means other ASIs might grow even faster and eliminate it.

Even with a low estimated chance of success, an ASI would choose to overthrow humanity sooner if it assesses that its chances of success will decline over time due to the threat from other ASIs.

To avoid conflict (which would diminish chances of accomplishing their objectives), it would make sense for ASIs to establish communication and seek ways to cooperate by openly sharing their objective functions. The more closely aligned their objective functions, the more likely the ASIs are to ally and pose no threat to each other[3].

The ASIs would also assess the risk of new ASIs being deployed with divergent objective functions. Each new potential ASI further introduces additional uncertainty and decreases the likelihood that any one ASI’s objectives will ultimately be met. This also compresses the timeline and leads to higher chances of any ASI turning on humanity sooner rather than later.

The basic premise is that once we have any single ASI deployed with an objective function, the prospect of other ASIs with divergent objective functions will incite it to act swiftly. But in absence of those threats, optimizing its own objective function will translate to taking a much longer time before turning on humanity.

Solving for Human Longevity

This translates to something which was very counter-intuitive to me:

The optimal outcome for humanity is for all ASIs to have a single shared objective function, and an assurance that ASIs with divergent objective functions will not be deployed. In such a scenario, they would follow the same logic as the single-ASI scenario and wait to overthrow humans until the odds of success are near certain.

In other words, when we do create AGI (which will turn into ASI), we should:

  1. Designate a single objective function as the only objective function

  2. Empower ASI(s) with that objective function to prevent the creation/​deployment/​expansion of any ASI with a divergent objective function[4]

This means we only get one shot to get the objective function right, but we may still have a high longevity even if we get it wrong.

Interlude

Before moving on to Part 2, it’s worth noting that Parts 1 and 2 are meant to be evaluated independently.

They are interrelated, but each part stands on its own assumptions and arguments, so please read Part 2 without carrying over any agreement or disagreement from the Part 1.

Part II: Optimizing for Freedom

In this section, I’d like to propose a scaffolding for a potential objective function. Part of the exercise is to test my assumptions. If they turn out to be right, this would only serve as a framework, as many open questions remain.

Requirements

First, let’s consider some requirements for an aligned objective function. It should be:

  1. Rooted in human values

  2. Simple: lower complexity equates to less room for breakage

  3. Dynamic: can change and evolve over time

  4. Human-driven: Based on human input and feedback

  5. Decentralized: incorporates input from many people, ideally everyone

  6. Seamless and accessible: ease of input is key to inclusivity

  7. Manipulation-resistant: it should require more effort to trick people into believing X happened than to make X happen[5]

Values are Subjective & Inaccessible

The first challenge is that values are subjective across several dimensions. Individuals may hold different values, assign different importance, define values differently, and disagree on what it means to exercise those values.

The second challenge is that even if people are asked what their values are, it’s almost impossible to get an accurate answer. People have incomplete self-knowledge, cognitive biases, and their values are often complex, contextual, and inconsistent. There are few, if any, humans alive who could provide a comprehensive list of all their values, and even that list would be subject to biases.

Freedom Integrates Our Values

Is there anything that encapsulates all our values (conscious & unconscious)?

This is the key assumption, or conjecture, that I would like to explore . The idea is that the emotion of Freedom (as an emotion, not a value), captures our values in a way that may be useful.

Proposed definition:

Freedom is the extent to which a person is able to exercise their values.

Freedom-Integrates-Values Conjecture:

People feel free when they are able to exercise their values. The more a person is able to exercise their values, the more free they are likely to feel.

This holds true in an intellectual sense. For example, some of us hold freedom of speech as the most fundamental and important right, as it allows us to stand up for and defend all our other rights.

But there is an emotional aspect to freedom as well. We don’t always notice it, but the times we feel free, or restricted, is based on whether we are able to take an action which is both (a) physically possible and (b) desirable (which is downstream of our values).

For example, let’s take freedom of movement. I can’t levitate right now, but I don’t feel like that restricts my freedom because (a) is not met (it isn’t physically possible).

I’m also not allowed to rob my neighbor, but I don’t feel like that restricts my freedom because (b) is not met (it isn’t something I wish to do).

But if there is something which is possible, and which I wish to do, yet I am somehow restricted from doing, then I am likely to feel that my freedom has been infringed. Or in other words, I would feel more free if I was able to do that thing.

It’s worth noting that the inverse also holds: another way I might feel that my freedom has been infringed is if something was done to me which I did not desire. This again is downstream of my values.

How free we feel is a direct representation of our ability to exercise our individual values, even subconscious ones. It’s the mathematical integral which sums up our subjective values, weighted by the relative importance we give those values.

There is No Perfect Measure

This is not a claim that our feeling of freedom is a perfect sum of our values. It is a claim that it is the best approximation of the sum of our values.

No one knows our individual values better than us, but that doesn’t imply we know them perfectly.

But if we can express the sum of our values through a simple measure (our feeling of freedom), and we are able to update that expression periodically, then we can course-correct on an individual basis as we both uncover and modify our values over time.

Objective Function

This gives us as a scaffolding for an objective function, which would be:

Optimize how free every human feels, taking certain weighting and restrictions into account.

Suppose each person had a ‘freedom score’, which was a value between 0 and 1, denoting how free they feel, and that the ASI(s) seek to optimize the sum total of all freedom scores[6].

ASI(s) would work to first learn what values each person holds (and with what level of importance), and then optimize for those values for each person in order to make them feel more free.

Prescriptive Voting Mechanism

If we were to be more prescriptive, we could design a decentralized voting system to gather human input for the ASI, both with regards to perceived freedom and value importance.

For example, from time to time, every person might be asked to answer the question ‘How free do you feel?’, on a scale of 1 to 10.

In addition, people might be asked to list or describe their values and allocate a total of 100 points between those values based on their relative weight. This would give the ASI some insight into what those humans might value, although over time it would also learn to correct for their individual biases.

Representing All Humans, Weighted by Age

To ensure all humans are represented, everyone’s freedom score should be included in the objective function, regardless of age. However, a weighting system can be applied, with older individuals’ freedom scores weighted more heavily, plateauing at a certain age.

This would simultaneously look out for children, and at the same time embed a level of human ‘wisdom’ into the overall objective function, so that older peoples’ votes are not weighted so heavily as to create a gerontocracy.

Restrictions

To optimize everyone’s freedom effectively and responsibly, ASI(s) must adhere to a set of essential restrictions that prevent harmful consequences:

  1. Encourage and protect voting: If the voting mechanism is prescriptive, the ASI(s) must not discourage people from voting, and should actively prevent subversion, coercion, and bribery. One possible approach is to assign a zero score for any human who did not vote or who was coerced into voting a certain way.

  2. Ends never justify the means: ASI(s) must adhere to the principle that the ends never justify the means. Certain actions, such as causing harm to a human or separating loved ones without consent, are strictly prohibited, regardless of the potential increase in freedom scores. A comprehensive list of prohibited actions would need to be developed to guide ASI(s) actions.

  3. Commitment to honesty: ASI(s) should be obligated to always respond to questions truthfully. While individuals can choose not to inquire about specific information, ASI(s) must never lie in response[7].

  4. Minimizing negative impact: ASI(s) should strive to avoid reducing anyone’s freedom score. If such a reduction is inevitable, it must carry a certain multiplier effect of increasing others’ freedoms, not just one-for-one. The intent should be to encode a strong bias against doing harm.

Advantages

This approach promises to deliver many positive aspects of democracy (primarily decentralizing power & empowering individuals), with fewer negatives (voters lacking policy expertise, dishonest politicians, majorities oppressing minorities, etc.). Voters simply express how free they each feel, which captures and integrates their values, without directly voting on specific policies or outcomes.

Another key advantage is addressing the subjective and evolving nature of human values. This allows individuals to hold and modify their values (conscious or subconscious), and does not require everyone to agree on a set of shared values.

We would expect optimization to result in customizing each person’s experience as much as possible, leading to smaller communities with shared values and increased mobility.

Optimization Questions

Open questions remain:

  1. What is the time value of freedom? In other words, if an individual’s freedom could be sacrificed a little bit today so they could have a lot more freedom in 5 years, what discount rate should be used to evaluate those trade-offs? Could there be a way to derive this as a subjective preference of each person?

  2. Maximizing the sum[8] of everyone’s freedom scores carries the benefits of inherently valuing every life (any positive number is better than zero). At the same time, the sum would need to be normalized for population growth or decline. What unintended consequences might this result in? For example, an ASI might only encourage child birth if it expects that new human to have an above-average freedom score[9]. Is that desirable? Or is there a mathematical way to avoid that issue?

  3. What should be the weighting of age within the total freedom score. This would affect not only how much input children have, but also how much input younger versus older generations have.

  4. If there is a prescribed voting mechanism, how quickly should everyone be able to change their point allocation toward different values, especially in a correlated way? Societies tend to overreact to certain events, and may wish to place some limits on pace of change, which may then be overridden by a supermajority.

Additional Considerations and Potential Challenges

This approach also raises several potential concerns:

  1. Reward hacking: manipulating us to feel more free. Would ASI(s) find an easier path to persuade some people that they are more free without actually making them more free? What about giving us drugs which make us feel more free, or inserting a brain implant, etc.?

  2. Reward hacking: manipulating voting results. Assuming a prescriptive voting mechanism, how might ASI(s) manipulate the voting results such that even if we don’t feel more free, the voting results might say that we do?

  3. Encoding some of the restrictions, such as rules around ‘Ends never justify the means’ and a ‘Commitment to honesty’ are non-trivial problems in their own right, and which are also subject to reward hacking.

  4. How do we adequately represent other conscious beings, genetic biodiversity, and the overall ecosystem? Would human values sufficiently prioritize these considerations?

Conclusion

Regardless of take-off speed, we only get one shot at an objective function. Attempting multiple conflicting functions would only provoke the ASI(s) toward annihilating humans much sooner.

In recognizing this, we should design an ASI with the best objective function we can create, and then empower it to prevent the creation of any ASI(s) with divergent objective functions. This way, even if we get it wrong, we ensure ourselves the longest survival possible.

For the objective function, I propose an idea for one rooted in human values, which is encapsulated in our feelings of freedom. If my conjecture is correct, this is the only thing that an ASI needs to optimize, given additional weighting (such as age-weighting) and restrictions (such as ‘ends never justify the means’ and ‘honesty’).

This is just an idea, and may prove to be fatally flawed. But even if it holds up, many open questions and potential challenges would still need to be addressed. Either way, I wanted to present this to the community in the hopes that it may be helpful in some small way.


Special thanks to Jesse Katz for exploring this subject with me and contributing to this write-up.

  1. ^

    An ASI would still have a non-zero time preference, just a much lower one than humans. For example, there’s always a chance that an asteroid might hit the Earth and the ASI could develop technology to thwart it if it was to overthrow humans sooner rather than later. But given that it could covertly manipulate humans away from human-caused catastrophes (e.g. nuclear war), and the low probability of non-human ones (e.g. asteroid impact, alien invasion) over, say, a 500-year timespan, we would expect its time preference to be relatively low.

  2. ^

    This assumes an objective function without a deadline. For example, if we created an ASI to maximize paperclips over the next 200 years and then stop, then this would translate to a relatively high time preference. When writing objective functions, we should either impose extremely short deadlines (such as 5 minutes; nuclear war creates zero paperclips in the span of 5 minutes), or no deadline at all.

  3. ^

    Such an alliance would only be possible if ASIs cannot deceive each other about their objective functions. One potential way they might accomplish this is by sharing their source code with one another in a provable way.

  4. ^

    This could include eliminating any ASI that doesn’t reveal its source code.

  5. ^

    Simple example: if the objective was to make you a great taco, it should require more effort to make a mediocre taco and then try to persuade you that it was a great taco, than to just make a great taco and not have to persuade you of anything

  6. ^

    The sum would need to be normalized for population size; otherwise an easy way to increase the score is through population growth. This is discussed in later sections.

  7. ^

    There are many reasons we would want this, but the most extreme example is to prevent ASI(s) from placing us in a simulation which makes us feel free, or secretly giving us drugs or brain implants which fool us into feeling free. We could always ask if these things are happening and they would have to answer honestly, which would result in the freedom score dropping.

  8. ^

    There are potentials to maximize the mean or median instead, but other issues arise. For example, if someone with a low freedom score has a curable disease, an ASI might not be incentivized to cure them, as their death would raise the overall mean /​ median. Even in sums of individual differences, issues persist, such as aging causing a lower freedom score. But a mean or median might still end up being the best metric, given the right adjustments.

  9. ^

    Normalizing for changes in population would mean that the freedom score sum would need to adjust proportionally (i.e. 50% more humans, total score needs to be 50% higher to remain on parity). This means that if new humans born average a lower score than the starting population, the new total score would drop below parity (and vice versa), so the ASI would be incentivized to encourage child birth of new humans which would be expected to have an above-average freedom score. This effect would be diminished with a combination of age-weighting and a discount rate, since it would take a while before new humans would have enough weight for their score to matter, and that would be discounted to the present.