Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
If someone wants to someday want to understand what you sometimes do with math besides… turning the math into exact code… …prove AI mathematically safe; which again, to be clear, is not a kind of thing that math can do in principle...
I want to push back against this some. (I’m not sure whether I’m arguing with the actual Yudkowsky, or with a plausible misinterpretation of Yudkowsky, but it seems worth saying either way.)
Some things with which I agree:
The safety of a given AI design depends not only on facts about math, but also on facts about the physical world.
Therefore, it is not possible to prove an AI design to be safe using math alone, without invoking any empirically grounded knowledge about the physical world.
Moreover, any sane project building safe ASI would conduct empirical tests of some kind.
However, it is also true that:
“Turning math into exact code” is actually pretty commonplace and not at all exotic or outlandish, like the quoted text might seem to imply. There is an entire mathematical science of algorithms, and many algorithms produced by this theory are routinely turned into exact code.
While it is true that (i) there are ways to incorporate heuristics into your code while staying safe, and also (ii) mathematical models can be used to reason about code by way of analogy, rather than direct implication, I also believe that (iii) Plan A for safe ASI should be that at least some critical core of the code will exactly correspond to the math, and we will even formally verify that this critical core satisfies that relevant theorems. At the very least, a sane civilization that’s not racing into doom would build ASI in this way.
There is nevertheless a reasonable sense in which AI can (and should) be “proven to be mathematically safe”. Of course, the mathematical argument that proves the AI to be safe rests on some assumptions that need to be empirically grounded in the very least. This is not dissimilar from cryptography, where we can prove a protocol to be mathematically safe, but must still work hard to ensure that the implementation actually obeys the assumptions of the mathematical model. And yet, the mathematical safety proof can (and probably must) do a lot of the heavy-lifting in establishing a strong overall case for safety.
This doesn’t contradict the OP, but is still important to note: to the extent that the safety case rests on experiments, these experiments must be interpreted through the lens of mathematical theory—otherwise there is (IMO) little chance of inferring the right generalizations from them.
Shifting the losses by one time step doesn’t really matter, since we’re mostly interested in the shape of the regret bound which (up to mild changes in the constants) is not affected by this.
a subset? Why is it not just that product space? I’m assuming it’s because this is a set of partial functions, but I don’t see how taking a subset lets you account for that.
You’re absolutely right, it should be a quotient space, not a subspace. In principle, it can be represented as a closed subspace of the product of copies of
In this case, as written, you don’t need to say “An open set is then an arbitrary union of basis elements”
Actually, we do? For example, consider the space
However, this set is not a basis set.
It is covered in the proof section.
You’re right, it’s supposed to be
where
Good catch, this sentence is very confused.
Epistemic status: half-baked
Arguably, an aligned AI should be aligned to the user’s prior as well as the user’s utility function. Hence, any value-learning protocol should also be doing prior-learning. The problem is, any learning process requires (explicitly or implicitly) its own prior. But shouldn’t this also be the user’s prior? Is this an infinite regress? Maybe not: here is a way out that seems elegant in a way.
For now, we will work in the Bayesian framework. Let
Mathematically, this is saying that there should be constant
So,
The problem is, this doesn’t describe a Bayesian agent: as the AI accumulates more evidence, its prior changes and hence its belief changes in a non-Bayesian way. Maaaybe this is some kind of “radical probabilism” (I don’t understand the latter well enough to say). From a different angle, what I really want is a priorist (“updateless”) specification of the agent’s policy, and atm I don’t know how to reconcile it with this “eigenprior”.
Also, this feels to good to be true: we get a canonical prior out of nothing? This brings to mind the sort of negative results found by Muller and then Hilton and Kramar. What I expect to be more likely is that we do need to choose some “ur-prior” for the AI, but maybe the sensitivity to this choice can be reduced by this kind of method. Perhaps the full-fledged setting with infinitely-many universes will admit existence but not uniqueness of eigenvectors, and then the precise choice of eigenvector will depend on the ur-prior.
What I meant is not “people only care about ~Dunbar number of people”, but something more like “the closest ~Dunbar number of people have [some fraction around the range 1/1000-1/2] of the total value”. Giuseppe Garibaldi was also influenced by considerations such as increasing his own status (or maybe even posthumous reputation).
As to “humans are not capable to behave this way rationally”, I disagree. (The whole point of decision theories like UDT/FDT is that you don’t need to rewrite your source code to behave in an a priori-optimal way, and I believe that I’m fully capable of following the recommendations of such decision theories—and do follow their recommendations. )There is probably also a sense in which we value something vaguely akin to “abstract moral concepts”, but this caches out to something very different from utilitarianism (closer to virtue ethics).
I’m not sure what do you mean by “does not scale much”, but I agree with everything else. (My own ideal outcome is not literally “I am the queen”, but the same principle applies.)
The above treatment of “CDT precommitment games” is problematic: the concept
Definition: A CDT decision problem is the following data. We have a set of variables
The parent relation must induce an acyclic directed graph. We also have a selected subset of decision variables
This is connected to our overall formalism by setting
The CDT counterfactuals and decision-rule are defined via a do-operator that forces
Definition: A CDT precommitment game is a CDT decision problem in which there is some special
1.
2. For some
3. For every
4. For every
This is connected to our abstract notion of precommitment game by setting
The underlying decision problem of the precommitment game is constructed by deleting
The game is said to be trivial when all variables with parent
Proposition: CDT is precommitment-stable in trivial precommitment games.
Definition: Given a CDT precommitment game with
Proposition: If
Above, I compare different decision theories to FDT. At the same time, I claim that in a deeper sense, FDT is ill-defined. One may doubt whether that is a coherent line of reasoning. Therefore, instead of a comparison to FDT, I propose to frame these observations as being about stability to precommitments. Details follow.
Definition: A precommitment game
1. We are given some
2. We are given
3. For any
4. Denote
The restriction of
Definition: An EDT precommitment game
1.
2.
The underlying decision problem is then an EDT decision problem with the belief
(Is there a natural generalization without the assumption
Proposition: EDT is precommitment-stable in formally causal precommitment games. That is, in any such game there is
For example, XOR blackmail can be formalized as an EDT precommitment game which is not formally causal and EDT is not precommitment-stable there (the only optimal policy is precommitting to reject).
Definition: [EDIT: The treatment of CDT here is problematic, see child post.] A CDT precommitment game
The underlying decision problem is then a CDT decision problem with
Proposition: CDT is precommitment-stable in policy-bottlenecked precommitment games. That is, in any such game there is
For example, Newcomb’s paradox can be formalized as a CDT precommitment game is which is not policy-bottlenecked and CDT is not precommitment-stable there (the only optimal policy is precommitting to one-box).
Definition: A DDT precommitment game
The underlying decision problem is then a DDT decision problem with
Proposition: IDDT is precommitment-stable in pseudocausal precommitment games. That is, in any such game there is
It should be straightforward to also formulate an analogous claim with plain DDT and iterated pseudocausal precommitment games.
To make the claim that DDT/IDDT is precommitment-stable more often than EDT and CDT, we need to somehow compare different decision theories on the same game. For this purpose, we have the following translations.
Definition: Given an EDT precommitment game with
Proposition: If
Definition: Given a CDT precommitment game, its DDT-translation is defined by setting
Proposition: If
Below we only use the case
There is no objective morality, but there is such as a thing as objectively rational decision-making. And I never said anything about egoism.
Your comment sounds to me like it’s coming from a particular school of moral philosophy discourse, which (in my view) is built on the erroneous redefinition of words. In particular, “moral” and “rational”, together with various synonym-ish words, mean different very things in colloquial speech, but this type of moral philosophy discourse conflates them. In theory, you can of course define your words any way you like. However, if you do so, you relinquish the right the argue from any common sense intuitive claim that uses these words in their original meaning. (Which, in my view, is how fallacies are smuggled in during this kind of discourse.)
Similarly, “egoism” and “taking rational actions according to your own preferences” are also very different things.
(Thank you for your comment, my explanation here is a useful addition to the OP, I think.)
Takes on moral philosophy and the history of this community that I mostly mentioned before but should maybe be put together somewhere:
Human preferences are very partial/parochial, and this is meta-endorsed. There is a finite number P s.t. for any N>0, the lives of N strangers are less than P times as terminally-valuable for you as the life of your loved one. If you want to be honest with yourself (which you should if it’s high-value for you to have accurate beliefs), you should endorse this.
(Objective, abstract) Morality is fake, both non-cognitivism and error theory have merit. Parochial altruistic preferences (=empathy) are real, rational and superrational cooperation are real. Morality-as-used-in-practice is a process of continuous negotiations about social norms (the “social contract” if you like).
In particular, utilitarianism is very confused. That said, (super)rational cooperation can cash out as something utilitarianism-ish in some situations. (For example, if it is best for everyone if we precommit to derailing the trolley even if a personal friend is on the other track.)
Paradoxes such as Pascal’s mugging, population ethics and infinite ethics all stem from trying to use a confused framework (impartial and unbounded utility).
This type of confusion contributed to the failure of Old MIRI’s agent foundations programme, by causing it to over-index on ideas like Pascal’s mugging and the procrastination paradox.
The self-deceptive endorsement of impartial unbounded utility obscures the importance of multi-agent considerations in morality-as-used-in-practice, and this contributed to failures of the Effective Altruism movement such as SBF and OpenAI. Ideas such as the “pivotal act” are also sus in a similar way (although I can see versions of that which might be justified).
This argument uses the assumption that Alice can’t change eir beliefs in response to learning that Omega has proposed specific bets and not others.
Not true. Changing her beliefs in response to Omega’s proposal doesn’t help her. Imagine that Alice is given a choice between
Take a bet that pays +2 if X and −1 if not-X.
Take a bet that pays +2 if not-X and −1 if X.
Refuse both bets.
No matter what probability Alice assigns to X after her update, “normal” Bayesian calculus (really CDT calculus, see below) mandates that she chooses 1 or 2, not 3.
It seems clear that a bookie can reliably make money from gamblers if the bookie knows which horse will win which race; this is not, in the classical way of thinking, a testament to the irrationality of the gamblers.
I guess this example assumes the gamblers are not allowed to update on the offered bets? (Otherwise it doesn’t make sense to me.) Like I said, we don’t assume it here.
Instead, infrabayesianism recommends a strict preference for mixed strategies.
Not really, you’re over-indexing on the somewhat outdated 6 year old post you’re replying to. It is true that if Alice has a coin that Omega cannot predict, she can come ahead by betting according to the coin. But, as my 1-2-3 example above demonstrates, this is not the core idea. The “modern” formulation of infra-Bayesianism only allows deterministic policies, whereas randomization is modeled by means such as “taking the action to flip a coin”.
That version relies on a “causal” assumption that Omega’s choices are probabilistically independent of the gambler’s. This assumption seems inherently contrary to the problem description (since Omega can predict the gambler’s choices, and uses those predictions to make its choices).
What is actually going on here, this is not a Dutch Book argument against Bayesianism per se, this is a Dutch Book argument against Bayesian CDT. CDT-Alice believes that choosing to bet on X doesn’t influence the veracity of X, since there is indeed no physical causal link from the former to the latter (X might even be determined before the bet is offered or made). EDT-Alice can succeed here by noticing that her own choice is correlated with X and therefore the probability of X differs between the “Alice bets on X” counterfactual and the “Alice bets on not-X” counterfactual.
So, why is this example interesting beyond other examples that undermine CDT?
Mainly, it’s just easier to understand how infra-Bayesianism solves the problem here, and in particular we only need (crisp) credal sets rather supradistributions (fuzzy credal sets).
Another reason is, the notion of subjective probability is often justified by thinking about bets. But thinking about bets requires a decision theory, and not just a theory of epistemology. Hence, once you noticed that you’re confused about decision theory, you should be open to reconsidering the notion of subjective probability as well.
Yet another reason is, there’s something interesting going on where the supra-POMDP method of dealing with Newcombian problems preserves causality in some sense, while the EDT solution “violates” it. I thought it’s notable, although probably more important are the cases where EDT fails altogether (while infra-Bayesianism / DDT succeeds).
I haven’t tried LZP in practice, but you can guess what results to expect by looking at the size of the LZ77-compression of the text. I expect that any remotely decent text prediction algorithm would be based on stochastic process prediction. The deterministic setting is just a toy model.
Thanks for the catch!
It’s supposed to look like the control panel of the Enterprise.
A few more observations.
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don’t have to make this assumption. Instead, we can assume a set of possible observations
I believe that Theorem 4 remains valid.
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large constant. In the limit, DDT converges to something I dubbed “Idealized Disambiguative Decision Theory” (IDDT)[1], which works as follows.
For IDDT, it is sufficient to let
For problems coming from unambiguous FDT,
The decision rule is then
Notice that it is now invariant w.r.t. adding constants to
Proposition 5: For any stable problem, it holds that (i) any IDDT-optimal policy is FDT-optimal (ii) there is an FDT-optimal policy which is IDDT-optimal. For any pseudocausal problem, it also holds that any FDT-optimal policy is IDDT-optimal.
One might think, based on this proposition, that IDDT is a superior decision theory to DDT. However, I think that IDDT is incompatible with learning, because of its discontinuous dependence on probabilities.
(Based on Aumann, Hart and Perry.) We will operationalize the problem by assuming the agent’s decision may deterministically depend on observing a coin flip. To simplify the presentation, we assume a single coin flip per intersection, which limits the resulting probabilities to
Denote by
Denote by
Consistently with our source, we set the loss function to be
This problem is formally causal. However, as opposed to all previous examples, it has no extensive form! Hence, EDT in the sense we defined it is ill-posed: to apply EDT reasoning here we need to at least supplement it by a theory of anthropic probabilities. CDT’s counterfactuals agree with FDT’s if we posit that the do-operator is constrained to choosing among “absent-minded” policies.
Previously we described the self-coordination problem, but perhaps self-PD is a more striking example.
Here,
Using the obvious notations
The loss is the usual PD loss of the “factual” player.
This problem is not formally causal, because e.g.
The natural CDT interpretation is the one where the factual policy controls the counterfacual player and the counterfactual policy controls the factual player. (Alas, the terminology gets confusing here: in one case the words “factual” and “counterfactual” refer to the agent’s policy, and in the other case to the coin’s outcome.) Both CDT and EDT play
IDDT is related to the old idea of “surmeasures” from the original infra-Bayesianism sequence.
We can also imagine equipping the agent with a “self-belief”
What you propose here doesn’t address the issue of non-realizability at all. For example, let’s say
Yes