AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.
E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org
AI alignment researcher supported by MIRI and LTFF. Working on the learning-theoretic agenda. Based in Israel. See also LinkedIn.
E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org
My model is that the concept of “morality” is a fiction which has 4 generators that are real:
People have empathy, which means they intrinsically care about other people (and sufficiently person-like entities), but, mostly about those in their social vicinity. Also, different people have different strength of empathy, a minority might have virtually none.
Superrational cooperation is something that people understand intuitively to some degree. Obviously, a minority of people understand it on System 2 level as well.
There is something virtue-ethics-like which I find in my own preferences, along the lines of “some things I would prefer not to do, not because of their consequences, but because I don’t want to be the kind of person who would do that”. However, I expect different people to differ in this regard.
The cultural standards of morality, which it might be selfishly beneficial to go along with, including lying to yourself that you’re doing it for non-selfish reasons. Which, as you say, becomes irrelevant once you secure enough power. This is a sort of self-deception which people are intuitively skilled at.
Is it possible to replace the maximin decision rule in infra-Bayesianism with a different decision rule? One surprisingly strong desideratum for such decision rules is the learnability of some natural hypothesis classes.
In the following, all infradistributions are crisp.
Fix finite action set and finite observation set . For any and , let
be defined by
In other words, this kernel samples a time step out of the geometric distribution with parameter , and then produces the sequence of length that appears in the destiny starting at .
For any continuous[1] function , we get a decision rule. Namely, this rule says that, given infra-Bayesian law and discount parameter , the optimal policy is
The usual maximin is recovered when we have some reward function and corresponding to it is
Given a set of laws, it is said to be learnable w.r.t. when there is a family of policies such that for any
For we know that e.g. the set of all communicating[2] finite infra-RDPs is learnable. More generally, for any we have the learnable decision rule
This is the “mesomism” I taked about before.
Also, any monotonically increasing seems to be learnable, i.e. any s.t. for we have . For such decision rules, you can essentially assume that “nature” (i.e. whatever resolves the ambiguity of the infradistributions) is collaborative with the agent. These rules are not very interesting.
On the other hand, decision rules of the form are not learnable in general, and so are decision rules of the form for monotonically increasing.
Open Problem: Are there any learnable decision rules that are not mesomism or monotonically increasing?
A positive answer to the above would provide interesting generalizations of infra-Bayesianism. A negative answer to the above would provide an interesting novel justification of the maximin. Indeed, learnability is not a criterion that was ever used in axiomatic constructions of decision theory[3], AFAIK.
We can try considering discontinuous functions as well, but it seems natural to start with continuous. If we want the optimal policy to exist, we usually need to be at least upper semicontinuous.
There are weaker conditions than “communicating” that are sufficient, e.g. “resettable” (meaning that the agent can always force returning to the initial state), and some even weaker conditions that I will not spell out here.
I mean theorems like VNM, Savage etc.
First, given nanotechnology, it might be possible to build colonies much faster.
Second, I think the best way to live is probably as uploads inside virtual reality, so terraforming is probably irrelevant.
Third, it’s sufficient that the colonists are uploaded or cryopreserved (via some superintelligence-vetted method) and stored someplace safe (whether on Earth or in space) until the colony is entirely ready.
Fourth, if we can stop aging and prevent other dangers (including unaligned AI), then a timeline of decades is fine.
I don’t know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction).
I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds.
Commit to not make anyone predictably regret supporting the project or not opposing it” is worrying only by omission—it’s a good guideline, but it leaves the door open for “punish anyone who failed to support the project once the project gets the power to do so”.
Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don’t make threats, i.e. don’t try to shape others incentives in ways that they would be better off precommitting not to go along with it.
It’s not because they’re not on Earth, it’s because they have a superintelligence helping them. Which might give them advice and guidance, take care of their physical and mental health, create physical constraints (e.g. that prevent violence), or even give them mind augmentation like mako yass suggested (although I don’t think that’s likely to be a good idea early on). And I don’t expect their environment to be fragile because, again, designed by superintelligence. But I don’t know the details of the solution: the AI will decide those, as it will be much smarter than me.
I don’t have to know in advance that we’re in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology).
I think you’re misinterpreting Yudkowsky’s quote. “Using the null string as input” doesn’t mean “without evidence”, it means “without other people telling me parts of the answer (to this particular question)”.
I’m not sure what is “extremely destructive and costly” in what I described? Unless you mean the risk of misalignment, in which case, see above.
I know, this is what I pointed at in footnote 1. Although “dumbest AI” is not quite right: the sort of AI MIRI envision is still very superhuman in particular domains, but is somehow kept narrowly confined to acting within those domains (e.g. designing nanobots). The rationale mostly isn’t assuming that at that stage it won’t be possible to create a full superintelligence, but assuming that aligning such a restricted AI would be easier. I have different views on alignment, leading me to believe that aligning a full-fledged superintelligence (sovereign) is actually easier (via PSI or something in that vein). On this view, we still need to contend with the question, what is the thing we will (honestly!) tell other people that our AI is actually going to do. Hence, the above.
People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their “pivotal act” strategy. The latter can be described as “build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order” (e.g. the notorious “melt all GPUs” example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.
Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat “apples and oranges”. Suppose that we do live in a hard-takeoff singleton world, what then? One answer is “create a trustworthy, competent, multinational megaproject”. Alright, but suppose you can’t create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?
Here is my proposed rough sketch of such a plan[1]:
Commit to not make anyone predictably regret supporting the project or not opposing it. This rule is the most important and the one I’m the most confident of by far. In an ideal world, it should be more-or-less sufficient in itself. But in the real world, it might be still useful to provide more tangible details, which the next items try to do.
Within the bounds of Earth, commit to obey the international law, and local law at least inasmuch as the latter is consistent with international law, with only two possible exceptions (see below). Notably, this allows for actions such as (i) distributing technology that cures diseases, reverses aging, produces cheap food etc. (ii) lobbying for societal improvements (but see superpersuation clause below).
Exception 1: You can violate any law if it’s absolutely necessary to prevent a catastrophe on the scale comparable with a nuclear war or worse, but only to the extent it’s necessary for that purpose. (e.g. if a lab is about to build unaligned AI that would kill millions of people and it’s not possible to persuade them to stop or convince the authorities to act in a timely manner, you can sabotage it.)[2]
Build space colonies. These space colonies will host utopic societies and most people on Earth are invited to immigrate there.
Exception 2: A person held in captivity in a manner legal according to local law, who faces death penalty or is treated in a manner violating accepted international rules about treatment of prisoners, might be given the option to leave to the colonies. If they exercise this option, their original jurisdiction is permitted to exile them from Earth permanently and/or bar them from any interaction with Earth than can plausibly enable activities illegal according to that jurisdiction[3].
Commit to adequately compensate any economy hurt by emigration to the colonies or other disruption by you. For example, if space emigration causes the loss of valuable labor, you can send robots to supplant it.
Commit to not directly intervene in international conflicts or upset the balance of powers by supplying military tech to any side, except in cases when it is absolutely necessary to prevent massive violations of international law and human rights.
Commit to only use superhuman persuasion when arguing towards a valid conclusion via valid arguments, in a manner that doesn’t go against the interests of the person being persuaded.
Importantly, this makes stronger assumptions about the kind of AI you can align than MIRI-style pivotal acts. Essentially, it assumes that you can directly or indirectly ask the AI to find good plans consistent with the commitments below, rather than directing it to do something much more specific. Otherwise, it is hard to use Exception 1 (see below) gracefully.
A more conservative alternative is to limit Exception 1 to catastrophes that would spill over to the space colonies (see next item).
It might be sensible to consider a more conservative version which doesn’t have Exception 2, even though the implications are unpleasant.
Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.
I missed that paragraph on first reading, mea culpa. I think that your story about how it’s a win for interpretability and alignment is very unconvincing, but I don’t feel like hashing it out atm. Revised to weak downvote.
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Because it’s capability research. It shortens the TAI timeline with little compensating benefit.
Downvoted because conditional on this being true, it is harmful to publish. Don’t take it personally, but this is content I don’t want to see on LW.
Intuitively, it feels that there is something special about mathematical knowledge from a learning-theoretic perspective. Mathematics seems infinitely rich: no matter how much we learn, there is always more interesting structure to be discovered. Impossibility results like the halting problem and Godel incompleteness lend some credence to this intuition, but are insufficient to fully formalize it.
Here is my proposal for how to formulate a theorem that would make this idea rigorous.
Fix some natural hypothesis class for mathematical knowledge, such as some variety of tree automata. Each such hypothesis represents an infradistribution over : the “space of counterpossible computational universes”. We can say that is a “true hypothesis” when there is some in the credal set (a distribution over ) s.t. the ground truth “looks” as if it’s sampled from . The latter should be formalizable via something like a computationally bounded version of Marin-Lof randomness.
We can now try to say that is “rich” if for any true hypothesis , there is a refinement which is also a true hypothesis and “knows” at least one bit of information that doesn’t, in some sense. This is clearly true, since there can be no automaton or even any computable hypothesis which fully describes . But, it’s also completely boring: the required can be constructed by “hardcoding” an additional fact into . This doesn’t look like “discovering interesting structure”, but rather just like brute-force memorization.
What if instead we require that knows infinitely many bits of information that doesn’t? This is already more interesting. Imagine that instead of metacognition / mathematics, we would be talking about ordinary sequence prediction. In this case it is indeed an interesting non-trivial condition that the sequence contains infinitely many regularities, s.t. each of them can be expressed by a finite automaton but their conjunction cannot. For example, maybe the -th bit in the sequence depends only the largest s.t. divides , but the dependence on is already uncomputable (or at least inexpressible by a finite automaton).
However, for our original application, this is entirely insufficient. This is because in the formal language we use to define (e.g. combinator calculus) has some “easy” equivalence relations. For example, consider the family of programs of the form “if 2+2=4 then output 0, otherwise...”. All of those programs would output 0, which is obvious once you know that 2+2=4. Therefore, once your automaton is able to check some such easy equivalence relations, hardcoding a single new fact (in the example, 2+2=4) generates infinitely many “new” bits of information. Once again, we are left with brute-force memorization.
Here’s the improved condition: For any true hypothesis , there is a true refinement s.t. conditioning on any finite set of observations cannot produce a refinement of .
There is a technicality here, because we’re talking about infradistributions, so what is “conditioning” exactly? For credal sets, I think it is sufficient to allow two types of “conditioning”:
For any given observation and , we can form .
For any given observation s.t. , we can form .
This rules-out the counterexample from before: the easy equivalence relation can be represented inside , and then the entire sequence of “novel” bits can be generated by a conditioning.
Alright, so does actually satisfy this condition? I think it’s very probable, but I haven’t proved it yet.
Linkpost to Twitter thread is a bad format for LessWrong. Not everyone has Twitter.
I agree that in the long-term it probably matters little. However, I find the issue interesting, because the failure of reasoning that leads people to ignore the possibility of AI personhood seems similar to the failure of reasoning that leads people to ignore existential risks from AI. In both cases it “sounds like scifi” or “it’s just software”. It is possible that raising awareness for the personhood issue is politically beneficial for addressing X-risk as well. (And, it would sure be nice to avoid making the world worse in the interim.)
.
What is ? Also, we should allow adding some valid reward function of .
is a polytope with , corresponding to allowed action distributions at that state.
I think it’s mathematically cleaner to get rid of A and have those be abstract polytopes.
Did anyone around here try Relationship Hero and has opinions?
The solution is here. In a nutshell, naive MWI is wrong, not all Everett branches coexist, but a lot of Everett branches do coexist s.t. with high probability all of them display expected frequencies.