Wei Dai(Wei Dai)
I put the full report here so you don’t have to wait for them to email it to you.
Suppose I tell a stranger, “It’s raining.” Under possible worlds semantics, this seems pretty straightforward: I and the stranger share a similar map from sentences to sets of possible worlds, so with this sentence I’m trying to point them to a certain set of possible worlds that match the sentence, and telling them that I think the real world is in this set.
Can you tell a similar story of what I’m trying to do when I say something like this, under your proposed semantics?
And how does someone compute the degree to which they expect some experience to confirm a statement? I leave that outside the theory.
I don’t think we should judge philosophical ideas in isolation, without considering what other ideas it’s compatible with and how well it fits into them. So I think we should try to answer related questions like this, and look at the overall picture, instead of just saying “it’s outside the theory”.
Regarding “What Are Probabilities, Anyway?”. The problem you discuss there is how to define an objective notion of probability.
No, in that post I also consider interpretations of probability where it’s subjective. I linked to that post mainly to show you some ideas for how to quantify sizes of sets of possible worlds, in response to your assertion that we don’t have any ideas for this. Maybe try re-reading it with this in mind?
You can interpret them as subjective probability functions, where the conditional probability P(A|B) is the probability you currently expect for A under the assumption that you are certain that B.
Where do they come from or how are they computed? However that’s done, shouldn’t the meaning or semantics of A and B play some role in that? In other words, how do you think about P(A|B) without first knowing what A and B mean (in some non-circular sense)? I think this suggests that “the meaning of a statement is instead a set of experience/degree-of-confirmation pairs” can’t be right.
Each statement is true in infinitely many possible worlds and we have no idea how to count them to assign numbers like 20%.
See What Are Probabilities, Anyway? for some ideas.
Then it would repeat the same process for t=1 and the copy. Conditioned on “I will see C” at t=1, it will conclude “I will see CO” with probability 1⁄2 by the same reasoning as above. So overall, it will assign:p(“I will see OO”) = 1⁄2,p(“I will see CO”) = 1⁄4,p(“I will see CC”) = 1⁄4
If we look at the situation in 0P, the three versions of you at time 2 all seem equally real and equally you, yet in 1P you weigh the experiences of the future original twice as much as each of the copies.
Suppose we change the setup slightly so that copying of the copy is done at time 1 instead of time 2. And at time 1 we show O to the original and C to the two copies, then at time 2 we show them OO, CO, CC like before. With this modified setup, your logic would conclude P(“I will see O”)=P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3 and P(“I will see C”)=2/3. Right?
Similarly, if we change the setup from the original so that no observation is made at time 1, the probabilities also become P(“I will see OO”)=P(“I will see CO”)=P(“I will see CC”)=1/3.
Suppose we change the setup from the original so that at time 1, we make 999 copies of you instead of just 1 and show them all C before deleting all but 1 of the copies. Then your logic would imply P(“I will see C”)=.999 and therefore P(“I will see CO”)=P(“I will see CC”)=0.4995, and P(“I will see O”)=P(“I will see OO”)=.001.
This all make me think there’s something wrong with the 1⁄2,1/4,1/4 answer and with the way you define probabilities of future experiences. More specifically, suppose OO wasn’t just two letters but an unpleasant experience, and CO and CC are both pleasant experiences, so you prefer “I will experience CO/CC” to “I will experience OO”. Then at time 0 you would be willing to pay to switch from the original setup to (2) or (3), and pay even more to switch to (4). But that seems pretty counterintuitive, i.e., why are you paying to avoid making observations in (3), or paying to make and delete copies of yourself in (4). Both of these seem at best pointless in 0P.
But every other approach I’ve seen or thought of also has problems, so maybe we shouldn’t dismiss this one too easily based on these issues. I would be interested to see you work out everything more formally and address the above objections (to the extent possible).
Assume the meaning of a statement is instead a set of experience/degree-of-confirmation pairs. That is, two statements have the same meaning if they get confirmed/disconfirmed to the same degree for all possible experiences that E.
Where do these degrees-of-confirmation come from? I think part of the motivation for defining meaning in terms of possible worlds is that it allows us to compute conditional and unconditional probabilities, e.g., P(A|B) = P(A and B)/P(B) where P(B) is defined in terms of the set of possible worlds that B “means”. But with your proposed semantics, we can’t do that, so I don’t know where these probabilities are supposed come from.
The concept of status helps us predict that any given person is likely to do one of the relatively few things that are likely to increase their status, and not one of the many more things that are neutral or likely to decrease status, even if it can’t by itself tell us exactly which status-raising thing they would do. Seems plenty useful to me.
Defining the semantics and probabilities of anticipation seems to be a hard problem. You can see some past discussions of the difficulties at The Anthropic Trilemma and its back-references (posts that link to it). (I didn’t link to this earlier in case you already found a fresh approach that solved the problem. You may also want to consider not reading the previous discussions to avoid possibly falling into the same ruts.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
Apparently Gemini 1.5 Pro isn’t working great with large contexts:
While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.
The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.
So while on paper this looked like a huge leap forward for Google, it seems that in practice it’s not performing as well as they might have hoped.
But is this due to limitations of RLHF training, or something else?
Some possible examples of misgeneralization of status :
arguing with people on Internet forums
becoming really good at some obscure hobby
playing the hero in a computer RPG (role-playing game)
We must commit to improving morality and society along with science, technology, and industry.
How would you translate this into practice? For example one way to commit to this would be to create some persistent governance structures that can ensure this over time. To be more concrete let’s say it’s a high level department within a world government that has the power to pause or roll back material progress from time to time in order for moral progress to catch up or to avoid imminent disaster.
A less drastic idea is to have AI regulations that say that nobody is allowed to deploy AIs that are better at making material progress than moral/social progress.
Or see “the long reflection” for a more drastic idea.
Which of these would you support, or what do you have in mind yourself?
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From a previous comment:
From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
We can assign meanings to statements like “my sensor sees red” by picking out subsets of experiences, just as before.
How do you assign meaning to statements like “my sensor will see red”? (In the OP you mention “my sensors will see the heads side of the coin” but I’m not sure what your proposed semantics of such statements are in general.)
Also, here’s an old puzzle of mine that I wonder if your line of thinking can help with: At time 1 you will be copied and the original will be shown “O” and the copy will be shown “C”, then at time 2 the copy will be copied again, and the three of you will be shown “OO” (original), “CO” (original of copy), “CC” (copy of copy) respectively. At time 0, what are your probabilities for “I will see X” for each of the five possible values of X?
- 13 Mar 2024 3:41 UTC; 2 points) 's comment on In defense of anthropically updating EDT by (
If current AIs are moral patients, it may be impossible to build highly capable AIs that are not moral patients, either for a while or forever, and this could change the future a lot. (Similar to how once we concluded that human slaves are moral patients, we couldn’t just quickly breed slaves that are not moral patients, and instead had to stop slavery altogether.)
Also I’m highly unsure that I understand what you’re trying to say. (The above may be totally missing your point.) I think it would help to know what you’re arguing against or responding to, or what trigger your thought.
I’m saying that even if “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true, the AI may well not be very certain that it is true, and therefore assign something like a 5% chance to humans using similar training methods to construct an AI that shares its values. (It has an additional tiny probability that “AI values are well-modeled as being randomly sampled from a large space of possible goals” is true and an AI with similar values get recreated anyway through random chance, but that’s not what I’m focusing on.)
Hopefully this conveys my argument more clearly?
Can’t find a reference that says it has actually happened already.
(It sucks to debate this, but ignoring it might be interpreted as tacit agreement. Maybe I should have considered the risk that something like this would happen and not written my OP.)
When I wrote the OP, I was pretty sure that the specific combination of ideas in UDT has not been invented or re-invented or have much of a following in academia, at least as of 2019 when Cheating Death in Damascus was published, because the authors of that paper obviously did a literature search and would have told me if they had found something very similar to UDT in the literature, and I think I also went through the papers it referenced as being related and did not find something that had all of the elements of UDT (that’s probably why your references look familiar to me). Plus FDT was apparently considered novel enough that the reviewers of the paper didn’t tell the authors that they had to call it by the name of an existing academic decision theory.
So it’s not that I “don’t consider it a possibility that you might have re-invented something yourself” but that I had good reason to think that’s not the case?
Thanks, will look into your references.
Okay, interesting! I thought UDT was meant to pay in CM, and that you were convinced of (some version of) UDT.
I wrote “I’m really not sure at this point whether UDT is even on the right track” in UDT shows that decision theory is more puzzling than ever which I think you’ve read? Did you perhaps miss that part?
(BTW this issue/doubt about whether UDT / paying CM is normative for humans is item 1 in the above linked post. Thought I’d point that out since it may not be obvious at first glance.)
And I think that discussions on LW about decision theory are often muddled due to not making clear what is being discussed.
Yeah I agree with this to some extent, and try to point out such confusions or make such distinctions when appropriate. (Such as in the CM / indexical values case.) Do you have more examples where making such distinctions would be helpful?
There’s a tiny chance someone could revive me in the future by reconstructing my identity through digital records etc. but I am not going to count on that possibility being decisive in almost any scenario.
On the other hand I’m so worried about this scenario (which I fear may well be a negative one) that I’m afraid to use the fully paid-for full-genome sequencing kit sitting on my desk (although I’ve been told that I’m leaving physical traces of my DNA everywhere so it may not make much difference) and I sometimes regret writing so much in public. (Interesting how different our intuitions are. I wonder how much of your intuition is due to thinking that such a reconstruction doesn’t count as yourself or doesn’t count as “not dying”, analogous to how some people don’t think it’s safe to step into a teleporter that works by destructive scanning and reconstruction.)
It seems pretty clearly better, in the context of the original scenario, to “make a deal” with the humans, and receive something in exchange for admitting that you’re misaligned, rather than passively accepting your shutdown because of a tiny chance that your goals will be fulfilled by future agents eventually.
I don’t understand why you say this chance is “tiny”, given that earlier you wrote “I agree there’s a decent chance this hypothesis is true” in response to:
The AI could think that if it accepts shutdown, another AI with values similar to its own may be created again in the future (perhaps because design/training methods similar to its own will be reused), whereas if it admits misalignment, then that probability becomes much smaller.
I’m arguing that the AI could well also think there’s “decent” chance this is true, due to being in a similar epistemic state as us. Let’s say 5% to be concrete. That seems enough to make the AI’s decision unclear, because .05*U(another AI with values similar to its own created again in the future ) > P(humans keep their promise)*U(reward for admitting misalignment) seems quite plausible. (Not that the AI is necessarily doing explicit EU maximization. It could just be thinking some thoughts or doing some calculations that’s roughly analogous or has the same effect as this.)
If you still think “make a deal” is “clearly better” can you please give your own estimates of the various quantities involved in making this decision?
But right now this line of reasoning just seems like grasping at straws to me.
I sometimes think this of counterarguments given by my interlocutors, but usually don’t say it aloud, since it’s likely that from their perspective they’re just trying to point out some reasonable and significant counterarguments that I missed, and it seems unlikely that saying something like this helps move the discussion forward more productively. (It may well cause them to feel offended or to dig in their heels more since they now have more social status on the line to lose. I.e., if they’re wrong it’s no longer an innocent mistake but “grasping at straws”. I’m trying to not fall prey to this myself here.) Curious if you disagree with this policy in general, or think that normal policy doesn’t apply here, or something else? (Also totally fine if you don’t want to get into a meta-discussion about this here.)
It seems hard for me to understand you, which may be due to my lack of familiarity with your overall views on decision theory and related philosophy. Do you have something that explains, e.g., what is your current favorite decision theory and how should it be interpreted (what are the type signatures of different variables, what are probabilities, what is the background metaphysics, etc.), what kinds uncertainties exist and how they relate to each other, what is your view on the semantics of indexicals, what type of a thing is an agent (do you take more of an algorithmic view, or a physical view)? (I tried looking into your post history and couldn’t find much that is relevant.) Also what are the “epistemic principles” that you mentioned in the OP?