Mathematical Logic grad student, doing AI Safety research for ethical reasons.
Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.
My webpage.
Leave me anonymous feedback.
Mathematical Logic grad student, doing AI Safety research for ethical reasons.
Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.
My webpage.
Leave me anonymous feedback.
While the variety of answers is indeed surprising, I think many of them could be read as different accounts of a single central intuition, so that we’d end up with ~7 deep interpretations instead of 17 different answers.
For example, I think all of the following can be understood as different accounts of the “there being something that it feels like to be me”, or “experiencing the Cartesian theatre”:
experience of distinctive affective states, proprioception, awakeness, and maybe mind-location
Guy who reinvents predictive processing through Minecraft
Thank you for engaging, Eliezer.
I completely agree with your point: an agent being updateless doesn’t mean it won’t learn new information. In fact, it might perfectly decide to “make my future action A depend on future information X”, if the updateless prior so finds it optimal. While in other situations, when the updateless prior deems it net-negative (maybe due to other agents exploiting this future dependence), it won’t.
This point is already observed in the post (see e.g. footnote 4), although without going deep into it, due to the post being meant for the layman (it is more deeply addressed, for example, in section 4.4 of my report). Also for illustrative purposes, in two places I have (maybe unfairly) caricaturized an updateless agent as being “scared” of learning more information. While really, what this means (as hopefully clear from earlier parts of the post) is “the updateless prior assessed whether it seemed net-positive to let future actions depend on future information, and decided no (for almost all actions)”.
The problem I present is not “being scared of information”, but the trade-off between “letting your future action depend on future information X” vs “not doing so” (and, in more detail, how exactly it should depend on such information). More dependence allows you to correctly best-respond in some situations, but also could sometimes get you exploited. The problem is there’s no universal (belief-independent) rule to assess when to allow for dependence: different updateless priors will decide differently. And need to do so in advance of letting their deliberation depend on their interactions (they still don’t know if that’s net-positive).
Due to this prior-dependence, if different updateless agents have different beliefs, they might play very different policies, and miscoordinate. This is also analogous to different agents demanding different notions of fairness (more here). I have read no convincing arguments as to why most superintelligences will converge on beliefs (or notions of fairness) that successfully coordinate on Pareto optimality (especially in the face of the problem of trapped priors i.e. commitment races), and would be grateful if you could point me in their direction.
I interpret you as expressing a strong normative intuition in favor of ex ante optimization. I share this primitive intuition, and indeed it remains true that, if you have some prior and simply want to maximize its EV, updatelessness is exactly what you need. But I think we have discovered other pro tanto reasons against updatelessness, like updateless agents probably performing worse on average (in complex environments) due to trapped priors and increased miscoordination.
Hi Elizabeth! Thanks for reaching out. Excuse my delay in response, and the length of this reply. It felt important to communicate the nuances in my views (and the anecdotal experiences, which in our past exchanges might not have come through.
you’d rather nutritional difficulties with veganism weren’t discussed, even when the discussion is truthful and focuses on mitigations within veganism
That’s not my position. To the extent the naive transition accounts are representative of what’s going on in rat/EA spheres, some intervention that reduces the number of transitions that are naive (while fixing the number of transitions) would be a Pareto-improvement. And an intervention that reduces the number of transitions that are naive, and decreases way less the number of transitions, would also be net-positive.
My worry, though, is that signaling out veganism for this is not the most efficient way to achieve this. I hypothesize that
Naive transitions are more correlated with social dynamics about insuficient self-care not exclusive (nor close to exclusive) to veganism in rat/EA spheres.
Independently of that, a message focused on veganism will turn out net-negative, because of the following aggregated collateral effects:
Decreasing the number of overall transitions too much. Or better said, incentivizing some thought-patterns and dynamics upstream of that decrease, that have even worse consequences than the decrease itself.
Relatedly, incentivizing a community that’s more prone to ignoring important parts of the holistic picture when that goes to the selfish benefit of individuals. (And that’s certainly something we don’t want to happen around the people taking important ethical decisions for the future.)
More on 2. below (On framing), but let me get into 1. first.
I was very surprised to hear those anecdotal stories of naive transitions, because in my anecdotal experience across many different vegan and animalist spaces, serious talk about nutrition, and a constant reminder to put health first, has been an ever-present norm. And, at the same time, a recognition that turning vegan, even with all these nutrition subtleties, is not nearly as difficult as people imagine it (certainly in part due to selection effects).[1]
I hypothesize that the distributional shift is due to properties of the social dynamics and individual mindspace that rat/EA circles inadvertently encourage, especially on wide-eyed newcomers. The same optimizing mindset leading to “burn-out / overwork / too much Huel / exotic unregulated diets / not taking care of your image / dangerous drug practices linked to work” around these spaces seems to me to be one of the central causes of these naive transitions. I think this is also psychologically linked to the rational justification I’ve heard from some x-riskers: their work is just too important to care about anything else. Obviously that backfires.
Now, even given the above, it is coherent to believe that, despite this common root, veganism in particular is such a prominent example, with so many negative consequences, that a straightforward intervention pushing the motto that veganism presents tradeoffs and can be difficult or not for everyone, is net-positive. After all, this is kind of a quantitative question. I claim that’s not the case, and it’s related to collective blind spots we shouldn’t ignore, which brings me back to 2.
On framing:
One thing that might be happening here, is that we’re speaking at different simulacra levels. I’m not claiming you’re saying anything untrue, just that the consequences of pushing this line in this way are probably net-negative.
Now, I understand the benefits of adopting the general adoption of the policy “state transparently the true facts you know, and that other people seem not to know”. Unfortunately, my impression is this community is not yet in a position in which implementing this policy will be viable or generally beneficial for many topics. And indeed, on some priors it makes sense that a community suddenly receiving an influx of external attention will have to slowly work up to that, if it is at all possible.[2]
I believe one of the topics is veganism, because of the strong intuitive aversion individuals across the board feel towards changing their diet for ethical reasons (and, I claim, this aversion is irrational and should be counteracted and scrutinized accordingly). In my anecdotal experience (of many years discussing veganism with vegans and non-vegans), an almost-total fraction of the justifications for not transitioning to a vegan diet easily fall to “Is that your true rejection?” (even if the best route is not always to mention that explicitly, of course).
I expected to see that change in EA circles. Unfortunately, that has not been my anecdotal experience. The animalist part of EA, so to speak, is very strongly concentrated in a few individuals (or better said, sub-communities), who have taken this issue seriously enough, and many times are even directly working on animal welfare. But when you move slightly away from that space into neighboring sub-communities (say, a randomly sampled alignment researcher), defensive motivated reasoning on the topic seems to go up again.[3]
And there are instances in which I’ve obtained direct evidence that individuals in decisive positions are not reasoning correctly about the tradeoffs involved. As an example, consider that some offices / programs / retreats don’t offer (as the free lunch for their members) a completely vegetarian menu (let alone vegan).
From the outside, I don’t understand how this decision can make sense for any rat/EA space. Even if it were true that veganism requires some more efforts, it would be because of the complications related to health tracking or food planning. Those are not present in this case. The food is served, for free. Organizers can put the extra effort to make sure that the offer is nutritionally complete each day, or across the week (as also happens with omnivore menus). But whoever is not vegan need not worry that hard about all that, since they’ll be eating omnivore outside the office. And whoever’s vegan should worry about that, just because they’re vegan. Having meat in this menu doesn’t directly improve anyone’s health.
And the few times I’ve seen this “from the inside”, that is, I’ve heard organisers’ reasoning about this decision, they really didn’t seem to have meaningful arguments, and appealed to a general notion of individual freedom which I think is not a good ethical proxy, and if translated to other domains would lead to bad “not taking side-effects seriously”, or “not taking the dangers of social dynamics seriously”.
I have, of course, heard the obvious argument (although not from organizers) that x-risk research is so important that, if having a vegan menu might slightly turn off a single valuable researcher, it’s not worth it. This of course resonates a lot with the optimizy mindspace referenced above. And here I’ll just say that I don’t think this is the kind of desperate community we want to build. That this can just the same turn off ethically conscious people, who we do want in our community. And that this mindset is very correlated with the “unconstrained obsession with talent” that has led the community to being partly captured by ML community, weird epistemic areas incentivizing bad elitism and power dynamics, etc. In simpler words, I think this blows past some healthy and necessary deontological fences (more in the next section).
I also cannot help but feel suspicious that these practices are so comfortably presented as the default, and alarmed (in some precautionary sense) that grant money is being used to finance something as horrible, and vast, and openly debated, as animal exploitation (more in the next section).
Let me also note that I don’t agree that your posts (and the ensuing comments and conversations) were focused on mitigations within veganism, due to their framing. Even if you truthfully discussed these mitigations, the general tone skeptical of the viability and importance of veganism was very clear, and it is obvious which message most people will get out of this post. I’d love to live in a world were I can trust your readers to Bayesian update and ignore framings, but it’d be self-delusional to think this will be the case in this situation, given the obvious strong pulls everyone has towards motivated ignorance of veganism (and the evidence I’ve obtained about that also inside this community), and how the framing / headliners / first-order updates from your posts resonate with those repeated one-dimensional rationalizations.
A background ethical disagreement:
One thing that might also be happening is just that we disagree ethically. After all, if I didn’t care at all about veganism (or related individual ethical practices), I wouldn’t care about how many vegans are lost, as long as naive vegans are decrease (ignoring second-order effects on community epistemics, as discussed above). And indeed, it is popular amongst some rationalists to doubt the possibility that animals can suffer, something I strongly ethically disagree with.[4]
But I’m not sure that’s the main driver of our disagreement. If we disagree about how hard to push veganism, or how deeply to consider the negative consequences of having a less vegan community, it might be because of a disagreement about where the utilitarianism / deontology line should be in this topic. (After all, you could very strongly worry about animal suffering, and nonetheless bet absolutely all your efforts on x-risk research, because of being a naive Expected Value Maximizer.) Or equivalently, about how bad the consequences for community dynamics can be, and whether it’s better to resort to rule utilitarianism on this one.
As an extreme example, I very strongly feel like financing the worst moral disaster of current times so that “a few more x-risk researchers are not slightly put off from working in our office” is way past the deontological failsafes. As a less extreme example, I strongly feel like sending a message that will predictably be integrated by most people as “I can put even less mental weight on this one ethical issue that sometimes slightly annoyed me” also is. And in both cases, especially because of what they signal, and the kind of community they incentivize.
The right way to discuss these challenges:
As must be clear, I’d be very happy with treating the root causes, related to the internalized optimizy and obsessive mindset, instead of the single symptom of naive vegan transitions. This is an enormously complex issue, but I a prior think available health and wellbeing resources, and their continuous establishment as a resource that should be used by most people (as an easy route to having that part of life under control and not spiraling, similar to how “food on weekdays” is solved for us by our employers), would provide the individualization and nuance that these problems require. Something like “hey, from now on, those of you that follow any slightly unconventional diet, or have this other thing, or suffer that other thing, can go talk to these people, and they will help you do blah” would sound pretty good, and possibly a welfare multiplier for some people. It certainly sounds better to me than just broadcasting the message “veganism is hard, consider the tradeoffs and either search for help, or drop veganism”. Additionally, “hard” is very variable, and the nuance of it and your observations will get lost.
Something like running small group analytics on some naive vegans as an excuse for them to start thinking more seriously about their health? Yes, nice! That’s individualized, that’s immediately useful. But additionally extracting some low-confidence conclusions and using them to broadcast the above message (or a message that will get first-order approximated to that by 75%) seems negative.
It feels weird for me to think about solutions to this community problem, since in my other spheres it hasn’t arisen. But thinking about which things that happened in those spheres could have contributed positively, the first things that come to mind are: talks / events / activities about sports and health (or even explicitly nutrition), memes about nutrition (post-ironic B12 slander, etc.), communal environments where this knowledge is likely to be shared (like literal cooking).
I also observe that the more individualized approach might work better for a more close-knit community, and that might be especially unattainable now. Maybe there’s some other way to bootstrap this habit. Relatedly, I’d feel safer about some more oversight with regards to some health practices in general (especially drugs, and especially newcomers). But I observe that anything looking like policing is complicated.
In any event, I’m no expert in community health, and my separate point stands that I think broadcasting that message is net-negative right now, because of the obvious bottom-line people would extract from it.
Thanks for reading all of that. Next weeks are busy so I might again take a bit long to reply. Nonetheless, I saw in your last post that you’re thinking about vegan epistemics. Just in case you’d find that valuable, I’d be willing to discuss those thoughts as well, or provide opinions on concrete topics, or just talk about my experience. But of course, no problem if you don’t.
And, to be fair, I have even observed this health consciousness in the few EAs I know who are very vocal about their veganism and animalism (they are few because inside EA I’ve been closer to AI safety than animal welfare).
I could also note that this situation is slightly different from a straightforward “these people believe this false fact”. More accurately, the truth value of this fact hasn’t been brought to their attention, because of a complex web of learned emotional and mental habits. I’m talking here about focusing on those habits, as opposed to the practice of veganism in particular.
It’s obviously hard to draw the boundary of motivated reasoning. I have observed pretty clear-cut cases, but let me just leave it at “these intelligent people are not thinking about / taking seriously / maintaining up to their epistemic standards this aspect of their life as much as they should according to some of their own stated preferences or revealed preferences (and as clearly they are able to)”.
Due to my views on consciousness and moral antirealism, I think deciding which physical systems count as suffering is an ethical choice (equivalent to saying “I care about this process not happening”), and not a purely descriptive one.
Hi Wei! Abram and I have been working on formalizing Logical Updatelessness for a few months. We’ve been mostly setting a framework and foundations using Logical Inductors, and building the obvious UDT algorithms. But we’ve also stumbled upon some of the above problems (especially pitfalls of EVM / commitment races, and logical conditionals vs counterfactuals / natural accounts of logically uncertain reasoning), and soon we’ll turn more thoroughly to the Game Theory enabled by this Learning Theory.
You’re welcome to join the PIBBSS Symposium on Friday 22nd 18:30 CEST, where I’ll be presenting some of our ideas (more info). We still have a lot of open avenues, so no in-depth write-up yet, but soon a First Report will exist.
Also, of course, feel free to hit me with a DM anytime.
(I will not try to prove transitivity here, since my goal is to get the overall picture across; I have not checked it, although I expect it to hold.)
Transitivity doesn’t hold, here’s a counterexample.
The intuitive story is: X’s action tells you whether Z failed, Y fails sometimes, and Z fails more rarely.
The full counterexample (all of the following is according to your beliefs ): Say available actions are 0 and 1. There is a hidden fair coin, and your utility is high if you manage to match the coin, and low if you don’t. Y peeks at the coin, and takes the correct action, except when it fails, which has a 1⁄4 chance. Z does the same, but it only fails with a 1⁄100 chance. X plays 1 iff Z has failed.
Given X’s and Y’s action, you always go with Y’s action, since X tells you nothing about the coin, and Y gives you some information. Given Z’s and Y’s actions, you always go with Z’s, because it’s less likely to have failed (even when they disagree). But given Z’s and X’s, there will be some times (1/100), in which you see X played 1, and then you will not play the same as Z.
The same counterexample works for beliefs (or continuous actions) instead of discrete actions (where you will choose a probability to believe, instead of an action ), but needs a couple small changes. Now both Z and Y fail with 1⁄4 probability (independently). Also, Y outputs its guess as 0.75 or 0.25 (instead of 1 or 0), because YOU (that is, ) will be taking into account the possibility that it has failed (and Y better output whatever you will want to guess after seeing it). Instead of Z, consider A as the third expert, which outputs 0.5 if Z and Y disagree, 15⁄16 if they agree on yes, and 1⁄16 if they agree on no. X still tells you whether Z failed. Seeing Y and X, you always go with Y’s guess. Seeing A and Y, you always go with A’s guess. But if you see A = 15⁄16 and X = 1, you know both failed, and guess 0. (In fact, even when you see X = 0, you will guess 1 instead of 15⁄16.)
Tangent:
AGI is the sweetest, most interesting, most exciting challenge in the world.
We usually concede this point and I don’t even think it’s true. Of course, even if I’m right, maybe we don’t want to push in this direction in dialogue, because it would set the bad precedent of not defending ethics over coolness (and sooner or later something cool will be unethical). But let me discuss anyway.
Of course building AGI is very exciting, it incentivizes some good problem-solving. But doing it through Deep Learning and like OpenAI does has an indeleble undercurrent of “we don’t exactly know what’s going on, we’re just stirring the linear algebra pile”. Of course that can already be an incredibly interesting engineering problem, and it’s not like you don’t need a lot of knowledge and good intuitions to make these hacky things work. But I’m sure the aesthetic predispositions of many (and especially the more mathematically oriented) will line up way better with “actually understanding the thing”. From this perspective Alignment, Deep Learning Theory, Decision Theory, understanding value formation, etc. feel fundamentally more interesting intellectual challenges. I share this feeling and I think many other people do. A lot of people have been spoiled by the niceness of math, and/or can’t stand the scientific shallowness of ML developments.
I think my position has been strongly misrepresented here.
As per the conclusion of this other comment thread, I here present a completely explicit explanation of where and how I believe my position to have been strongly misrepresented. (Slapstick also had a shot at that in this shorter comment.)
Misrepresentation 1: Mistaking arguments
Elizabeth summarizes
The charitable explanation here is that my post focuses on naive veganism, and Soto thinks that’s a made-up problem. He believes this because all of the vegans he knows (through vegan advocacy networks) are well-educated on nutrition.
It is false that I think naive veganism is a made-up problem, and I think Elizabeth is taking the wrong conclusions from the wrong comments.
Her second sentence is clearly a reference to this short comment of mine (which was written as a first reaction to her posts, before my longer and more nuanced explanation of my actual position):
I don’t doubt your anecdotal experience is as you’re telling it, but mine has been completely different, so much so that it sounds crazy to me to spend a whole year being vegan, and participating in animal advocacy, without hearing mention of B12 supplementation. Literally all vegans I’ve met have very prominently stressed the importance of dietary health and B12 supplementation.
As should be obvious, this is not contesting the existence of naive veganism (“I don’t doubt your anecdotal experience”), but just contrasting it with my own personal anecdotal experience. This was part of my first reaction, and didn’t yet involve a presentation of my actual holistic position.
Elizabeth arrives at the conclusion that, because of my anecdotal experience, I believe naive veganism doesn’t exist (I don’t trust the other anecdotal experiences reported by her or other commenters), and that’s the reason why I don’t agree with her framing. I think my longer explanation makes it evident that I’m not ignoring the existence of naive veganism, but instead quantitatively weighing against other consequences of Elizabeth’s posts and framings. For example:
To the extent the naive transition accounts are representative of what’s going on in rat/EA spheres, some intervention that reduces the number of transitions that are naive (while fixing the number of transitions) would be a Pareto-improvement. And an intervention that reduces the number of transitions that are naive, and decreases way less the number of transitions, would also be net-positive.
and
My worry, though, is that signaling out veganism for this is not the most efficient way to achieve this. I hypothesize that
Naive transitions are more correlated with social dynamics about insuficient self-care not exclusive (nor close to exclusive) to veganism in rat/EA spheres.
Of course, these two excerpts are already present in the screenshots presented in the post (which indeed contain some central parts of my position, although leave out some important nuance), so I find this misrepresentation especially hard to understand or explain. I think it’s obvious, when saying something like “Naive transitions are more correlated with social dynamics...”, that I endorse their existence (or at least their possible existence).
Yet another example, in the longer text I say:
It feels weird for me to think about solutions to this community problem, since in my other spheres it hasn’t arisen.
This explicitly acknowledges that this is a problem that exists in this community. Indeed, engaging with Elizabeth’s writing and other anecdotal accounts in the comments has updated upwards my opinion of how many naive vegans exist in the rationalist community. But this is mostly independent of my position, as the next paragraph addresses.
You might worry, still, that my position (even if this is not stated explicitly) is motivated in reality by such a belief, and other arguments are rationalizations. First off I will note that this is a more complex inference, and while it is possible to believe it, it’s clearly not a faithful representation of the explicit text, and should be flagged as such. But nonetheless, on the hopes of convincing you that this is not going on, I will point out that my actual position and arguments have to do mostly with social dynamics, epistemics, and especially the community’s relationship to ethics. (See Misrepresentation 2: Mistaking claims for proof.)
Given, then, that the most important consequences of all this go through dynamics, epistemics and relationship to ethics (since this community has some chance of steering big parts of the future), I think it’s clear that my position isn’t that sensitive to the exact number of naive vegans. My position is not about ignoring those that exist: it is about how to come about solving the problem. It is about praxis, framing and optimal social dynamics.
Even more concretely, you might worry I’m pulling off a Motte and Bailey, trying to “quietly imply” with my text that the number of naive vegans is low, even if I don’t say it explicitly. You might get this vibe, for example, from the following phrasing:
To the extent the naive transition accounts are representative of what’s going on in rat/EA spheres,...
I want to make clear that this phrasing is chosen to emphasize that I think we’re still somewhat far from having rigorous scientific knowledge about how prevalent naive veganism is in the community (for example, because your sample sizes are small, as I have mentioned in the past). That’s not to neglect the possibility of naive veganism being importantly prevalent, as acknowledged in excerpts above.
I also want to flag that, in the short comment mentioned above, I said the following:
since most vegans do supplement [citation needed, but it’s been my extensive personal and online experience, and all famous vegan resources I’ve seen stress this]
This indeed is expressing my belief that, generally, vegans do supplement, based on my anecdotal experience and other sources. This is not yet talking explicitly about whether this is the case within the community (I didn’t have strong opinions about that yet), but it should correctly be interpreted (in the context of that short comment) as a vague prior I am using to be (a priori) doubtful of deriving strong conclusions about the prevalence in the community. I do still think this vague prior is still somewhat useful, and that we still don’t have conclusive evidence (as mentioned above). But it is also true (as mentioned above) that this was my first reaction, written before my long text holistically representing my position, and since then I have updated upwards my opinion of how many naive vegans exist in the rationalist community. So it makes sense that this first short comment was more tinged by an implicit lower likelihood of that possibility, but this was superseded by further engaging with posts and comments, and that is explicitly acknowledged in my latter text, as the above excerpts demonstrate.
Finally, one might say “well, of course Elizabeth didn’t mean that you literally think 0 naive vegans exist, it was just a way to say you thought too few of them existed, or you were purposefully not putting weight on them”. First off, even if those had been my actual stated or implied positions, I want to note this use of unwarranted hyperbole can already tinge a summary with unwarranted implications (especially a short summary of a long text), and thus would find it an implicit misrepresentation. And this is indeed part of what I think is going on, and that’s why I repeatedly mention my problem is more with framing and course of action than with informational content itself. But also, as evidenced by the excerpts and reasoning above, it is not the case that I think too few naive vegans exist, or I purposefully don’t represent them. I acknowledge the possibility that naive veganism is prevalent amongst community vegans, and also imply that my worries are not too sensitive to the exact number of naive vegans, and are of a different nature (related to epistemics and dynamics).
In summary, I think Elizabeth went something like “huh, if Martín is expressing these complex worries, it probably is just because he thinks naive veganism is not a real problem, since he could be understood to have some doubts about that in his early comments”. And I claim that is not what’s going on, and it’s completely missing the point of my position, which doesn’t rely in any way on naive veganism not being a thing. On the contrary, it discusses directly what to do in a community where naive veganism is big. I hope the above excerpts and reasoning demonstrated that.
Misrepresentation 2: Mistaking claims
Elizabeth summarizes
I have a lot of respect for Soto for doing the math and so clearly stating his position that “the damage to people who implement veganism badly is less important to me than the damage to animals caused by eating them”.
This, of course, makes it very explicitly sound like in my text I only weigh two variables against each other: disvalue caused by naive veganism, and disvalue caused by animal exploitation.
This is missing a third variable that is very present in my long text, and to which many paragraphs are dedicated or make reference: the consequences of all this (posts, framing, actions, etc.) for social dynamics of the community, and the community’s (and its individuals’) relationship to ethics.
In fact, not only is this third variable very present in the text, but also in some places I explicitly say it’s the most important variable of the three, so demonstrating that my arguments have mostly to do with it. Here’s one excerpt making that explicit:
As an extreme example, I very strongly feel like financing the worst moral disaster of current times so that “a few more x-risk researchers are not slightly put off from working in our office” is way past the deontological failsafes. As a less extreme example, I strongly feel like sending a message that will predictably be integrated by most people as “I can put even less mental weight on this one ethical issue that sometimes slightly annoyed me” also is. And in both cases, especially because of what they signal, and the kind of community they incentivize.
Here’s another one, even more clear:
But I am even more worried about the harder-to-pin-down communal effects, “tone setting”, and the steering of very important sub-areas of the EA community into sub-optimal ethical seriousness (according to me), which is too swayed by intellectual fuzzies, instead of actual utilons.
And finally, in my response answering some clarificatory questions from Elizabeth (several days before this post was published), here’s an even more explicit one:
Of course, I too don’t optimize for “number of vegans in the world”, but just a complex mixture including that as a small part. And as hinted above, if I care about that parameter it’s mainly because of the effects I think it has in the community. I think it’s a symptom (and also an especially actionable lever) of more general “not thinking about ethics / Sincerity in the ways that are correct”. As conscious as the members of this community try to be about many things, I think it’s especially easy (through social dynamics) to turn a blind eye on this, and I think that’s been happening too much.
Indeed, one of Elizabeth’s screenshots already emphasizes this, placing it as one of the central parts of my argument (although doesn’t yet explicitly mention that it’s, for me, the most important consequence):
Relatedly, incentivizing a community that’s more prone to ignoring important parts of the holistic picture when that goes to the selfish benefit of individuals. (And that’s certainly something we don’t want to happen around the people taking important ethical decisions for the future.)
I do admit that, stylistically speaking, this point would have been more efficiently communicated had I explicitly mentioned its importance very near the top of my text (so that it appeared, for example, explicitly mentioned in her first screenshot).
Nonetheless, as the above excerpts show, the point (this third, even more important variable) was made explicit in some fragments of the text (even if the reader could have already understood it as implied by other parts that don’t mention it explicitly). And so, I cannot help but see Elizabeth’s sentence above as a direct and centrally important misrepresentation of what the text explicitly communicated.
You might worry, again, that there’s some Motte and Bailey going on, of the form “explicitly mention those things, but don’t do it at the top of the text, so that it seems like truly animal ethics is the only thing you care about, or something”. While I’m not exactly sure what I’d gain from this practice (since anyway it’s patent that many readers disagree with me ethically about the importance of animals, so I might as well downweigh its importance), I will still respond to this worry by pointing that, even if the importance of this third variable is only explicitly mentioned further down in the text, most of the text (and indeed, even parts of the screenshots) already deals with it directly, thus implying its importance and centrality to my position, and furthermore most of the text discussed / builds towards the importance of this third variable in a more detailed and nuanced way than just stating it explicitly (to give a better holistic picture of my thoughts and arguments).
In summary, not only does this representation neglect a central part of my text (and something that I explicitly mentioned was the most important variable in my argument), but also, because of that, attributes to me a statement that I haven’t stated and do not hold. While I am uncertain about it (mostly because of remaining doubts about how prevalent naive veganism is), it is conceivable (if we lived in a world with high naive veganism) that, if we ignored all consequences of these posts/framings/actions except for the two variables Elizabeth mentions, attacking naive veganism through these posts is at least net-positive (even if, still, in my opinion, not the optimal approach). But of course, the situation completely changes when realistically taking into account all consequences.
How might Elizabeth have arrived at this misrepresentation? Well, it is true that at the start of my long text I mention:
And an intervention that reduces the number of transitions that are naive, and decreases way less the number of transitions, would also be net-positive.
It is clear how this short piece of text can be interpreted as implying that the only two important variables are the number of naive transitions and the number of transitions (even if I shortly later make clear these are not the only important variables, and most of the text is devoted to discussing this, and I even explicitly mention that is not the case). But clearly that doesn’t imply that I believe “the damage to people who implement veganism badly is less important to me than the damage to animals caused by eating them”. I was just stating that, under some situations, it can make sense to develop certain kinds of health-focused interventions (to make evident that I’m not saying “one should never talk about vegan nutrition”, which is what Elizabeth was accusing me of doing). And indeed a central part of my position as stated in the text was that interventions are necessary, but of a different shape to Elizabeth’s posts (and I go on to explicitly recommend examples of these shapes). But of course that’s not the same as engaging in a detailed discussion about which damages are most important, or already taking into account all of the more complex consequences that different kinds of interventions can have (which I go on to discuss in more detail in the text).
Misrepresentation 3: Missing counter-arguments and important nuance
Elizabeth explains
There are a few problems here, but the most fundamental is that enacting his desired policy of suppressing public discussion of nutrition issues with plant-exclusive diets will prevent us from getting the information to know if problems are widespread. My post and a commenter’s report on their college group are apparently the first time he’s heard of vegans who didn’t live and breathe B12.
But I can’t trust his math because he’s cut himself off from half the information necessary to do the calculations. How can he estimate the number of vegans harmed or lost due to nutritional issues if he doesn’t let people talk about them in public?
First off, the repeated emphasis on “My post and a commenter’s report...” (when addressing this different point she’s brought up) again makes it sound as if my position was affected or relied on a twisted perception of the world in which naive vegans don’t exist. I have already addressed why this is not the case in Misrepresentation 1: Mistaking arguments, but I would like to call attention again to the fact that framing and tone are used to caricaturize my position, or make it seem like I haven’t explicitly addressed Elizabeth’s point here (and I haven’t done so because of a twisted perception). I already find this mildly misleading, given I both had directly addressed that point, and that the content of the text clearly shows my position doesn’t depend on the non-existence of naive veganism as a community problem. But of course, it’s not clear (in this one particular sentence) where authorial liberties of interpretation should end. Maybe Elizabeth is just trying to psychoanalyze me here, finding the hidden motives for my text (even when the text explicitly states different things). First, I would have preferred this to be flagged more clearly, since the impression I (and probably most readers, who of course won’t read my long comment) get from this paragraph is that of implying that my text showcased an obvious-to-all naiveté and didn’t address these points. Second, in Misrepresentation 1: Mistaking arguments I have argued why these hidden motives are not real (and again, that is clear form the content of the long text).
Now on to Elizabeth’s main point. In my response to Elizabeth’s response to my long text (which was sent several days before the post’s publication), I addressed some clarifications that Elizabeth had asked for. There, answering directly to her claims that (on her immediate experience) the number of “naive vegans turned non-naive” had been much greater than the number of “vegans turned non-vegan” (which, again, my holistic position doesn’t too strongly quantitatively rely on), I said:
The negative effects of the kind “everyone treats veganism less seriously, and as a result less people transition or are vocal about it” will be much more diffused, hard-to-track, and not-observed, than the positive effects of the kind “this concrete individual started vegan supplements”. Indeed, I fear you might be down-playing how easy it is for people to arrive (more or less consciously) at these rationalized positions, and that’s of course based on my anecdotal experience both inside and outside this community.
Thus, to her claim that I have “cut myself off from half the information”, I was already pre-emptively responding by noting that (in my opinion) she has cut herself off from the other half of the information, by ignoring these kind of more diluted effects (that, according to my position, have the biggest impact on the third and most important variable of social dynamics, epistemics, and ethical seriousness). Again, it is also clear in this excerpt that I am worrying more about “diluted effects on social dynamics” than about “the exact figure of how wide-spread naive veganism is”.
Indeed (and making a more general diagnostic of the misrepresentation that has happened here), I think Elizabeth hasn’t correctly understood that my holistic position, as represented in those texts (and demonstrated in the excerpts presented above), brought forth a more general argument, not limited to short-term interventions against naive veganism, nor sensitively relying on how widespread naive veganism is.
Elizabeth understands me as saying “we should ignore naive veganism”. And then, of course, the bigger naive veganism is, the bigger a mistake I might have been making. But in reality my arguments and worries are about framing and tone, and comparing different interventions based on all of their consequences, including the “non-perfectly-epistemic” consequences of undesirably exacerbating this or that dynamic. Here’s an excerpt of my original long text exemplifying that:
As must be clear, I’d be very happy with treating the root causes, related to the internalized optimizy and obsessive mindset, instead of the single symptom of naive vegan transitions. This is an enormously complex issue, but I a prior think available health and wellbeing resources, and their continuous establishment as a resource that should be used by most people (as an easy route to having that part of life under control and not spiraling, similar to how “food on weekdays” is solved for us by our employers), would provide the individualization and nuance that these problems require.
Even more clearly, here is one excerpt where I mention I’m okay with running clinical trials to get whatever information we might need to better navigate this situation:
Something like running small group analytics on some naive vegans as an excuse for them to start thinking more seriously about their health? Yes, nice! That’s individualized, that’s immediately useful. But additionally extracting some low-confidence conclusions and using them to broadcast the above message (or a message that will get first-order approximated to that by 75%) seems negative.
The above makes clear that my worry is not about obtaining or making available that information. It is about the framing and tone of Elizabeth’s message, and the consequences it will have when naively broadcast (without accounting for a part of reality: social dynamics).
Finally, Elizabeth says my desired policy is “suppressing public opinion”. Of course, that’s already a value judgement, and it’s tricky to debate what counts as “suppressing public opinion”, and what as “acknowledging the existence of social dynamics, and not shooting yourself in the foot by doing something that seems bad when taking them into account”. I’m confident that my explanations and excerpts above satisfactorily argue for my having advocated for the latter, and not the former. But then again, as with hidden motives mentioned above, arriving at different conclusions than I do about this (about the nature of what I have written) is not misrepresentation, just an opinion.
What I do find worrisome is how this opinion has been presented and broadcast (so, again, framing). If my position had been more transparently represented, or if Elizabeth had given up on trying to represent it faithfully in a short text, and Elizabeth had nonetheless mentioned explicitly that her interpretation of that text was that I was trying to suppress public discussion (even though I had explicitly addressed public discussion and when and how it might be net-positive), then maybe it would have been easier for the average reader to notice that there might be an important difference of interpretations going on here, and that they shouldn’t update so hard on her interpretation as if I had explicitly said “we shouldn’t publicly discuss this (under any framing)”. And even then I would worry this over-represented your side of the story (although part of that is unavoidable).
But this was not the case. These interpretations were presented in a shape pretty indistinguishable from what would have been an explicit endorsed summary. Her summary looks completely the same as it would look had I not addressed and answered the points she brings up in any way, or explicitly stated the claims and attitudes she attributes to me.
In summary, although I do think this third misrepresentation is less explicitly evident than the other two (due to mixing up with Elizabeth’s interpretation of things), I don’t think her opinions have been presented in a shape well-calibrated about what I was and wasn’t saying, and I think this has led the average reader to, together with Elizabeth, strongly misrepresent my central positions.
Thank you for reading this wall of overly explicit text.
Here’s a link to the recording.
Here’s also a link to a rough report with more details about our WIP.
Since this hypothesis makes distinct predictions, it is possible for the confidence to rise above 50% after finitely many observations.
I was confused about why this is the case. I now think I’ve got an answer (please anyone confirm):
The description length of the Turing Machine enumerating theorems of PA is constant. The description length of any Turing Machine that enumerates theorems of PA up until time-step n and the does something else grows with n (for big enough n). Since any probability prior over Turing Machines has an implicit simplicity bias, no matter what prior we have, for big enough n the latter Turing Machines will (jointly) get arbitrarily low probability relative to the first one. Thus, after enough time-steps, given all observations are PA theorems, our listener will assign arbitrarily higher probability to the first one than all the rest, and thus the first one will be over 50%.
Edit: Okay, I now saw you mention the “getting over 50%” problem further down:
I don’t know if the argument works out exactly as I sketched; it’s possible that the rich hypothesis assumption needs to be “and also positive weight on a particular enumeration”. Given that, we can argue: take one such enumeration; as we continue getting observations consistent with that observation, the hypothesis which predicts it loses no weight, and hypotheses which (eventually) predict other things must (eventually) lose weight; so, the updated probability eventually believes that particular enumeration will continue with probability > 1⁄2.
But I think the argument goes through already with the rich hypothesis assumption as initially stated. If the listener has non-zero prior probability on the speaker enumerating theorems of PA, it must have non-zero probability on it doing so in a particular enumeration. (unless our specification of the listener structure doesn’t even consider different enumerations? but I was just thinking of their hypothesis space as different Turing Machines the whole time) And then my argument above goes through, which I think is just your argument + explicitly mentioning the additional required detail about the simplicity prior.
Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).
I don’t understand what you mean by “allocating bargaining power”, given already each agent has true veto power. Regardless of the negotiation mechanism you set up for them (if it’s high-bandwidth enough), or whether the master agent says “I’d like this or that agent to have more power”, each subagent could go “give me my proportional (1/n) part of the slice, or else I will veto everything” (and depending on its prior about how other agents could respond, this will seem net-positive to do).
In fact that’s just the tip of the iceberg of individually rational game-theoretic stuff (that messes with your proposal) they could pull off, see Commitment Races.
You might have already talked about this in the meeting (I couldn’t attend), but here goes
. This is around where I have problems. I just can’t quite manage to get myself to see how this quantity is the “slice of marginal utility that coalition promises to player i”, so let me know in the comments if anyone manages to pull it off.
Let’s reason this out for a coalition of 3 members, the simplest case that is not readily understandable (as in your “Alice and Bob” example). We have . We can interpret as the strategic gain obtained (for 1) thanks to this 3 member coalition, that is a direct product of this exact coalition’s capability for coordination and leverage, that is, that doesn’t stem from the player’s own potentials () neither was already present from subcoalitions (like ). The only way to calculate this exact strategic gain in terms of the is to subtract from all these other gains that were already present. In our case, when we rewrite , we’re only saying that is the supplementary gain missing from the sum if we only took into account the gain from the 1-2 coalition plus the further marginal gain added by being in a coalition with 3 as well, and didn’t consider the further strategic benefits that the 3 member coalition could offer. Or expressed otherwise, if we took into account the base potential and added the two marginal gains and .
Of course, this is really just saying that , which is justified by your (and Harsanyi’s) previous reasoning, so this might seem like a trivial rearrangement which hasn’t provided new explanatory power. One might hope, as you seem to imply, that we can get a different kind of justification for the formula, by for instance appealing to bargaining equilibria inside the coalition. But I feel like this is nowhere to be found. After all, you have just introduced/justified/defined , and this is completely equivalent to . It’s an uneventful numerical-set-theoretic rearrangement. Not only that, but this last equality is only true in virtue of the “nice coherence properties” justification/definition you have provided for the previous one, and would not necessarily be true in general. So it is evident that any justification for it will be a completely equivalent reformulation of your previous argument. We will be treading water and ultimately need to resource back to your previous justification. We wouldn’t expect a qualitatively different justification for than for , so we shouldn’t either expect one in this situation (although here the trivial rearrangement is slightly less obvious than subtracting , because to prove the equivalence we need to know those equalities hold for every and ).
Of course, the same can be said of the expression for , which is an equivalent rearrangement of that for the : any of its justifications will ultimately stem from the same initial ideas, and applying definitions. It will be the disagreement point for a certain subgame because we have defined it/justified its expression just like that (and then trivially rearranged).
Please do let me know if I have misinterpreted your intentions in some way. After all, you probably weren’t expecting the controversial LessWrong tradition of dissolving the question :-)
Brain-dump on Updatelessness and real agents
Building a Son is just committing to a whole policy for the future. In the formalism where our agent uses probability distributions, and ex interim expected value maximization decides your action… the only way to ensure dynamic stability (for your Son to be identical to you) is to be completely Updateless. That is, to decide something using your current prior, and keep that forever.
Luckily, real agents don’t seem to work like that. We are more of an ensemble of selected-for heuristics, and it seems true scope-sensitive complete Updatelessnes is very unlikely to come out of this process (although we do have local versions of non-true Updatelessness, like retributivism in humans).
In fact, it’s not even exactly clear how I would use my current brain-state could decide something for the whole future. It’s not even well-defined, like when you’re playing a board-game and discover some move you were planning isn’t allowed by the rules. There are ways to actually give an exhaustive definition, but I suspect the ones that most people would intuitively like (when scrutinized) are sneaking in parts of Updatefulness (which I think is the correct move).
More formally, it seems like what real-world agents do is much better-represented by what I call “Slow-learning Policy Selection”. (Abram had a great post about this called “Policy Selection Solves Most Problems”, which I can’t find now.) This is a small agent (short computation time) recommending policies for a big agent to follow in the far future. But the difference with complete Updatelessness is that the small agent also learns (much more slowly than the big one). Thus, if the small agent thinks a policy (like paying up in Counterfactual Mugging) is the right thing to do, the big agent will implement this for a pretty long time. But eventually the small agent might change its mind, and start recommending a different policy. I basically think that all problems not solved by this are unsolvable in principle, due to the unavoidable trade-off between updating and not updating.[1]
This also has consequences for how we expect superintelligences to be. If by them having “vague opinions about the future” we mean a wide, but perfectly rigorous and compartmentalized probability distribution over literally everything that might happen, then yes, the way to maximize EV according to that distribution might be some very concrete, very risky move, like re-writing to an algorithm because you think simulators will reward this, even if you’re not sure how well that algorithm performs in this universe.
But that’s not how abstractions or uncertainty work mechanistically! Abstractions help us efficiently navigate the world thanks to their modular, nested, fuzzy structure. If they had to compartmentalize everything in a rigorous and well-defined way, they’d stop working. When you take into account how abstractions really work, the kind of partial updatefulness we see in the world is what we’d expect. I might write about this soon.
Surprisingly, in some conversations others still wanted to “get both updatelessness and updatefulness at the same time”. Or, receive the gains from Value of Information, and also those from Strategic Updatelessness. Which is what Abram and I had in mind when starting work. And is, when you understand what these words really mean, impossible by definition.
Marginally against legibilizing my own reasoning:
When taking important decisions, I spend too much time writing down the many arguments, and legibilizing the whole process for myself. This is due to completionist tendencies. Unfortunately, a more legible process doesn’t overwhelmingly imply a better decision!
Scrutinizing your main arguments is necessary, although this looks more like intuitively assessing their robustness in concept-space than making straightforward calculations, given how many implicit assumptions they all have. I can fill in many boxes, and count and weigh considerations in-depth, but that’s not a strong signal, nor what almost ever ends up swaying me towards a decision!
Rather than folding, re-folding and re-playing all of these ideas inside myself, it’s way more effective time-wise to engage my System 1 more: intuitively assess the strength of different considerations, try to brainstorm new ways in which the hidden assumptions fail, try to spot the ways in which the information I’ve received is partial… And of course, share all of this with other minds, who are much more likely to update me than my own mind. All of this looks more like rapidly racing through intuitions than filling Excel sheets, or having overly detailed scoring systems.
For example, do I really think I can BOTEC the expected counterfactual value (IN FREAKING UTILONS) of a new job position? Of course a bad BOTEC is better than none, but the extent to which that is not how our reasoning works, and the work is not really done by the BOTEC at all, is astounding. Maybe at that point you should stop calling it a BOTEC.
I don’t doubt your anecdotal experience is as you’re telling it, but mine has been completely different, so much so that it sounds crazy to me to spend a whole year being vegan, and participating in animal advocacy, without hearing mention of B12 supplementation. Literally all vegans I’ve met have very prominently stressed the importance of dietary health and B12 supplementation. Heck, even all the vegan shitposts are about B12!
comparing [~vegans who don’t take supplements] to [omnivores who don’t take supplements] will give the clearest data
Even if that might be literally true for scientific purposes (and stressing again that the above project clearly doesn’t have robust scientific evidence as its goal), I do think this won’t be an accurate representation of the picture when presented to the community, since most vegans do supplement [citation needed, but it’s been my extensive personal and online experience, and all famous vegan resources I’ve seen stress this], and thus you’re comparing the non-average vegan to the average omnivore, giving a false sense of imbalance against veganism. As rational as we might try to be, framing of this kind matters, and we all are especially emotional and visceral with regards to something as intimate and personal as our diet. On average, people raised omnivores have strong repulsion towards veganism (so much so as to override ethical concerns), and I think we should take that into account.
To me it feels like the natural place to draw the line is update-on-computations but updateless-on-observations.
A first problem with this is that there is no sharp distinction between purely computational (analytic) information/observations and purely empirical (synthetic) information/observations. This is a deep philosophical point, well-known in the analytic philosophy literature, and best represented by Quine’s Two dogmas of empiricism, and his idea of the “Web of Belief”. (This is also related to Radical Probabilisim.)
But it’s unclear if this philosophical problem translates to a pragmatic one. So let’s just assume that the laws of physics are such that all superintelligences we care about converge on the same classification of computational vs empirical information.
A second and more worrying problem is that, even given such convergence, it’s not clear all other agents will decide to forego the possible apparent benefits of logical exploitation. It’s a kind of Nash equilibrium selection problem: If I was very sure all other agents forego them (and have robust cooperation mechanisms that deter exploitation), then I would just do like them. And indeed, it’s conceivable that our laws of physics (and algorithmics) are such that this is the case, and all superintelligences converge on the Schelling point of “never exploiting the learning of logical information”. But my probability of that is not very high, especially due to worries that different superintelligences might start with pretty different priors, and make commitments early on (before all posteriors have had time to converge). (That said, my probability is high that almost all deliberation is mostly safe, by more contingent reasons related to the heuristics they use and values they have.)
You might also want to say something like “they should just use the correct decision theory to converge on the nicest Nash equilibrium!”. But that’s question-begging, because the worry is exactly that others might have different notions of this normative “nice” (indeed, no objective criterion for decision theory). The problem recurs: we can’t just invoke a decision theory to decide on the correct decision theory.
Am I missing something about why logical counterfactual muggings are likely to be common?
As mentioned in the post, Counterfactual Mugging as presented won’t be common, but equivalent situations in multi-agentic bargaining might, due to (the naive application of) some priors leading to commitment races. (And here “naive” doesn’t mean “shooting yourself in the foot”, but rather “doing what looks best from the prior”, even if unbeknownst to you it has dangerous consequences.)
if it comes up it seems that an agent that updates on computations can use some precommitment mechanism to take advantage of it
It’s not looking like something as simple as that will solve, because of reasoning as in this paragraph:
Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.
Or in other words, you need to decide on the precommitment ex ante, when you still haven’t thought much about anything, so your precommitment might be bad.
(Although to be fair there are ongoing discussions about this.)
Hi Elizabeth, I feel like what I wrote in those long comments has been strongly misrepresented in your short explanations of my position in this post, and I kindly ask for a removal of those parts of the post until this has cleared up (especially since I had in the past offered to provide opinions on the write-up). Sadly I only have 10 minutes to engage now, but here are some object-level ways in which you’ve misrepresented my position:
The charitable explanation here is that my post focuses on naive veganism, and Soto thinks that’s a made-up problem.
Of course, my position is not as hyperbolic as this.
his desired policy of suppressing public discussion of nutrition issues with plant-exclusive diets will prevent us from getting the information to know if problems are widespread
In my original answers I address why this is not the case (private communication serves this purpose more naturally).
I have a lot of respect for Soto for doing the math and so clearly stating his position that “the damage to people who implement veganism badly is less important to me than the damage to animals caused by eating them”.
As I mentioned many times in my answer, that’s not the (only) trade-off I’m making here. More concretely, I consider the effects of these interventions on community dynamics and epistemics possibly even worse (due to future actions the community might or might not take) than the suffering experienced by farmed animals murdered for members of our community to consume at present day.
I can’t trust his math because he’s cut himself off from half the information necessary to do the calculations. How can he estimate the number of vegans harmed or lost due to nutritional issues if he doesn’t let people talk about them in public?
Again, I addressed this in my answers, and argue that data of the kind you will obtain are still not enough to derive the conclusions you were deriving.
More generally, my concerns were about framing and about how much posts like this one can affect sensible advocacy and the ethical backbone of this community. There is indeed a trade-off here between transparent communications and communal dynamics, but that happens in all communities and ignoring it in ours is wishful thinking. It seems like none of my worries have been incorporated into the composition of this post, in which you have just doubled down on the framing. I think these worries could have been presented in a way healthier form without incurring in all of those framing costs, and I think its publication is net-negative due to the latter.
Off-topic, but I just noticed I’m reading your book for my Computational Complexity master’s course. And you’re posting here on alignment! Such a positive shock! :)
But yeah assume we have the meta-level thing. It’s not that the cognition of the system is mysteriously failing; it’s that it is knowingly averse to deception and to thinking about how it can ‘get around’ or otherwise undermine this aversion.
[...]
What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to “OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way.”
Maybe a part of what Nate is trying to say is:
Ensuring that the meta-level thing works robustly, or ensuring that the AI always runs that sort of very general and conscious routine, is already as hard as making the AI actually care about and uphold some of our concepts, that is, value alignment. Because of the multi-purpose nature and complexity of cognitive tools, any internal metrics for deception, or ad hoc procedures searching for deception that don’t actually use the full cognitive power and high-level abstract reasoning of the AI, or internal procedures “specified completely once at for all at some point during training” (that is, that don’t become more thorough and general while the AI does), will at some point break. The only way for this not to happen is for the AI to completely, consciously and with all its available tools constantly struggle to uphold the conceptual boundary. And this is already just an instance of “the AI pursuing some goal”, only now the goal, instead of being pointed outwards to the world (“maximize red boxes in our galaxy”) is explicitly pointed inwards and reflectively to the AI’s mechanism (“minimize the amount of such cognitive processes happening inside you”). And (according to Nate) pointing an AI in such a general and deep way towards any goal (this one included), is the very hard thing we don’t know how to do, “the whole problem”.
I still have hope that the ‘robust non-deceptiveness’ thing I’ve been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.
I get it you’re trying to express that you’re doubtful of that last opinion of (my model of) Nate’s: you think this goal is so natural that in fact it will be easier to point at than most other goals we’d need to solve alignment (like low-impact or human values).
Another coarse, on-priors consideration that I could have added to the “Other lenses” section:
Eliezer says something like “surely superintelligences will be intelligent enough to coordinate on Pareto-optimality (and not fall into something like commitment races), and easily enact logical or value handshakes”. But here’s why I think this outside-view consideration need not hold. It is a generally good heuristic to think superintelligences will be able to solve tasks that seem impossible to us. But I think this stops being the case for tasks whose difficulty / complexity grows with the size / computational power / intelligence level of the superintelligence. For a task like “beating a human at Go” or “turning the solar system into computronium”, the difficulty of the task is constant (relative to the size of the superintelligence you’re using to solve it). For a task like “beat a copy of yourself in Go”, that’s clearly not the case (well, unless Go has a winning strategy that a program within our universe can enact, which would be a ceiling on difficulty). I claim “ensuring Pareto-optimality” is more like the latter. When the intelligence or compute of all players grows, it is true they can find more clever and sure-fire ways to coordinate robustly, but it’s also true that they can individually find more clever ways of tricking the system and getting a bit more of the pie (and in some situations, they are individually incentivized to do this). Of course, one might still hold that the first will grow much more than the latter, and so after a certain level of intelligence, agents of a similar intelligence level will easily coordinate. But that’s an additional assumption, relative to the “constant-difficulty” cases.
Of course, if Eliezer believes this it is not really because of outside-view considerations like the above, but because of inside-views about decision theory. But I generally disagree with his takes there (for example here), and have never found convincing arguments (from him or anyone) for the easy coordination of superintelligences.