People often say things like “do x. Your future self will thank you.” But I’ve found that I very rarely actually thank my past self, after x has been done, and I’ve reaped the benefits of x.
This quick take is a preregistration: For the next month I will thank my past self more, when I reap the benefits of a sacrifice of their immediate utility.
e.g. When I’m stuck in bed because the activation energy to leave is too high, and then I overcome that and go for a run and then feel a lot more energized, I’ll look back and say “Thanks 7 am Morphism!”
(I already do this sometimes, but I will now make a TAP out of it, which will probably cause me to do it more often.)
Then I will make a full post describing in detail what I did and what (if anything) changed about my ability to sacrifice short-term gains for greater long-term gains, along with plausible theories w/ probabilities on the causal connection (or lack thereof), as well as a list of potential confounders.
Of course, it is possible that I completely fail to even install the TAP. I don’t think that’s very likely, because I’m #1-prioritizing my own emotional well-being right now (I’ll shift focus back onto my world-saving pursuits once I’m more stablely not depressed). In that case I will not write a full post because the experiment would have not even been done. I will instead just make a comment on this shortform to that effect.
Many things in the world want you to suffer. Signalling suffering is useful in many social situations. For example, suffering is a sign that one has little slack, and so entities that are out to get you will target those who signal suffering less.
Through Newcomblike self-deception, a person can come to believe that they are suffering. The easiest way to make yourself think that you are suffering is to actually suffer. In this way, the self-deception hyperstitions itself into reality. Perhaps a large amount of human suffering is caused by this.
Solving this problem may be of great interest to those who want to reduce human suffering.
I may write a longer post about this with more details and a more complete argument. If you particularly want this, please comment or dm, as that will make me more likely to write it.
suffering is a sign that one has little slack, and so entities that are out to get you will target those who signal suffering less.
Well, depending on the targeter, it counts against being targeted because there’s relatively less to expropriate, and it counts towards being targeted because you have less defenses and are more desperate / have a worse negotiation position.
Sometimes you want/need other people to help you, and if you display less suffering, they may assume that it’s not serious, and therefore won’t help you. This can be a problem for people who do not display suffering in neurotypical or culturally expected ways.
Sometimes there are situations where you are not allowed to say “no”, and then “I can’t, can’t anymore!!!” becomes the next best thing. Or sometimes people just suck at saying “no”.
I want to say that I particularly don’t want this post made unless you first attempt a lit review. This is something that’s been covered quite extensively in pre-existing literature, and I think it would be basically embarrassing (and likely have bad consequences) not to engage with that work at all before writing a longer post on this.
No, I am not eluding to a particular failure mode without naming it outright (I don’t think I do this much? We’ve talked a lot).
Especially bad consequences relative to other instances of this mistake because the topic relates to people’s relationship with their experience of suffering and potentially unfair dismissals of suffering, which can very easily cause damage to readers or encourage readers to cause damage to others.
If I write a post about Newcomblike suffering, I would probably want to encourage people to escape such situations without hurting others, and emphasize that, even if someone is ~directly inflicting this on you, thinking of it as “their fault” is counterproductive. Hate the game, not the players. They are in traps much the same as yours.
Where might I find such pre-existing literature? I have never seen this discussed before, though it’s sort of eluded* to in many of Zvi’s posts, especially in the immoral mazes sequence.
I must admit, if you’re talking about literature in the world of social psych outside Lesswrong, I don’t have much exposure to it, and I don’t really consider it worth my time to take a deep dive there, since their standards for epistemic rigor are abysmal.
But if you have pointers to specific pieces of research, I’d love to see them.
Not sure why you replied in three different places. I will (try to) reply to all of them here.
I do not consider linking to those Aella and Duncan posts a literature review, nor do I consider them central examples of work on this topic.
I am not going to do a literature review on your behalf.
Your explanation of how you will be careful gave me no confidence; the cases I’m worried about are related to people modeling others as undergoing ‘fake’ suffering, and ignoring their suffering on that basis. This is one of the major nexuses of abuse stumbled into by people interested in cognition. You have to take extreme care not to be misread and wielded in this way, and it just really looks like you have no interest in exercising that care. You’re just not going to anticipate all of the different ways this kind of frame can be damaging to someone and forbid them one by one.
I’d look at Buddhist accounts of suffering as a starting point. My guess is you will say that you don’t respect this work because its standards for epistemic rigor are abysmal; I invite you to consider that engaging with prior work, even and especially prior work you do not respect, is essential to upholding any reasonable epistemic standard.
Literally type your idea and ‘are there academic Buddhist texts that seem to relate to this?’ into ChatGPT. If you’re going to invite people to sink hundreds of cumulative person hours into reading your thing, you really should actually try to make it good, and part of that is having any familiarity at all with relevant background material.
Not sure why you replied in three different places. I will (try to) reply to all of them here.
I did this so that you could easily reply to them separately, since they were separate responses.
I do not consider linking to those Aella and Duncan posts a literature review, nor do I consider them central examples of work on this topic.
I did not link them for that reason. I linked them to ask whether my understanding of the general problem you’re pointing to is correct: “Especially bad consequences relative to other instances of this mistake because the topic relates to people’s relationship with their experience of suffering and potentially unfair dismissals of suffering, which can very easily cause damage to readers or encourage readers to cause damage to others.”
I am not going to do a literature review on your behalf.
Fair. I was simply wondering whether or not you had something to back up your claim that this topic has been covered “quite extensively”.
Your explanation of how you will be careful gave me no confidence; the cases I’m worried about are related to people modeling others as undergoing ‘fake’ suffering, and ignoring their suffering on that basis. This is one of the major nexuses of abuse stumbled into by people interested in cognition. You have to take extreme care not to be misread and wielded in this way, and it just really looks like you have no interest in exercising that care. You’re just not going to anticipate all of the different ways this kind of frame can be damaging to someone and forbid them one by one.
I would like to be clear that I do not intend to claim that Newcomblike suffering is fake in any way. Suffering is a subjective experience. It is equally real whether it comes from physical pain, emotional pain, or an initially false belief that quickly becomes true. Hopefully posting it in a place like Lesswrong will keep it mostly away from the eyes of those who will fail to see this point.
I again ask though, how would a literature review help at all?
I’d look at Buddhist accounts of suffering as a starting point.
This does vibe as possibly relevant.
If you’re going to invite people to sink hundreds of cumulative person hours into reading your thing, you really should actually try to make it good, and part of that is having any familiarity at all with relevant background material.
I’m not sure how to feel about this general attitude towards posting. I think with most things I would rather err on the side of posting something bad; I think a lot of great stuff goes unwritten because people’s standards on themselves are too high (of course, Scott’s law of advice reversal applies here, but I think, given I’ve only posted a handfull of times, I’m on the “doesn’t post enough” end of the spectrum). I try to start all of my posts with a TLDR, so that people who aren’t interested or who think they might be harmed by my post can steer clear. Beyond this, I think it’s the readers’ responsibility to avoid content that will harm them or others.
Fair. I was simply wondering whether or not you had something to back up your claim that this topic has been covered “quite extensively”.
The thing that backs it up is you looking literally at all. Anything that I suggest may not hit on the particular parts of the (underspecified) idea that are most salient to you and can therefore easily be dismissed out of hand. This results in a huge asymmetry of effort between me locating/recommending/defending something I think is relevant and you spending a single hour looking in the direction I pointed and exploring things that seem most relevant to you.
I would like to be clear that I do not intend to claim that Newcomblike suffering is fake in any way. Suffering is a subjective experience. It is equally real whether it comes from physical pain, emotional pain, or an initially false belief that quickly becomes true. Hopefully posting it in a place like Lesswrong will keep it mostly away from the eyes of those who will fail to see this point.
I am indifferent to the content of what you intend to claim! This is a difficult to topic to broach in a manner that doesn’t license people to do horrible things to themselves and others. The point I’m making isn’t that you are going to intentionally doing something bad; it is that I know this minefield well and would like to make you aware that it is, in fact, a minefield!
The LessWrong audience is not sanctified as the especially psychologically robust few. Ideas do bad things to people, and more acutely so here than in most places (e.g. Ziz, Roko). If you’re going to write a guide to a known minefield, maybe learn a thing or two about it before writing the guide.
I again ask though, how would a literature review help at all?
You are talking about something closely related to things a bunch of other people have talked about before you. Maybe one of them had something worthwhile to say, and maybe it’s especially important to investigate that when someone is putting their time into warning you that this topic is dangerous. Like, I obviously expected a fight when posting my initial comment, and I’m getting one, and I’m putting a huge amount of time into just saying over and over again “Please oh my god do not just pull something out of your ass on this topic and encourage others to read it, that could do a lot of damage, please even look in the direction of people who have previously approached this idea with some amount of seriousness.” And somehow you’re still just demanding that I justify this to you? I am here to warn you! Should I stand on my head? Should I do a little dance? Should I Venmo you $200?
Like, what lever could I possibly pull to get you to heed the idea that some ideas, especially ideas around topics like suffering and hyperstition, can have consequences for those exposed to them, and these can be subtle or difficult to point at, and you should genuinely just put any effort at all into investigating the topic rather than holding my feet to the fire to guess at which features are most salient to you and then orient an argument about the dangers in a manner that is to your liking?
I’m not sure how to feel about this general attitude towards posting. I think with most things I would rather err on the side of posting something bad; I think a lot of great stuff goes unwritten because people’s standards on themselves are too high.
Doesn’t apply when there are real dangers associated with a lazy treatment of a topic. Otherwise I just agree.
Beyond this, I think it’s the readers’ responsibility to avoid content that will harm them or others.
They will not know! It is your responsibility to frame the material in a way that surfaces its utility while minimizing its potential for harm. This is not a neutral topic that can be presented in a flat, neutral, natural, obvious way. It is charged, it is going to be charged, which sides are shown will be a choice of the author, and so far it looks like you’re content to lackadaisically blunder into that choice and blame others for tripping over landmines you set out of ignorance.
Again, I am a giant blinking red sign outside the suffering cave telling you ‘please read the brochure before entering the suffering cave to avoid doing harm to others,’ and you are making it my responsibility to convince you to read the brochure. From my perspective, you are a madman with hostages and a loaded gun! From your perspective, ignorant of the underspecified risks, I am wildly over-reacting. But you don’t know that you have a gun, and I am expensively penning a comment liable to receive multiple [Too Combative?] reacts because it is the most costly signal I know how to send along this channel. Please, dear god, actually look into it before publishing this post, and just try to see why these are ideas someone might think it’s worth being careful with!
Ok, I was probably not going to write the post anyway, but since no one seems to actively want it, your insistence that it requires this much extra care is enough to dissuade me.
I will say, though, that you may be committing a typical mind fallacy when you say “convincing is >>> costly to complying to the request” in your reply to Zack Davis’ comment. I personally dislike doing this kind of lit-review style research because in my experience it’s a lot of trudging through bullshit with little payoff, especially in fields like social psychology, and especially when the only guidance I get is “ask ChatGPT for related Buddhist texts”. I don’t like using ChatGPT (or LLMs in general; it’s a weakness of mine I admit). Maybe after a few years of capabilities advances that will change.
And it seems that I was committing a typical mind fallacy as well, since I implicitly thought that when you said “this topic has been covered extensively” you had specific writings in mind, and that all you needed to do was retrieve them and link them. I now realize that this assumption was incorrect, and I’m sorry for making it. It is clear now that I underestimated the cost that would be incurred by you in order to convince me to do said research before making a post.
I hope this concept gets discussed more in places like Lesswrong someday, because I think that there may be a lot of good we can do in preventing this kind of suffering, and the first step to solving a problem is pointing at it. But it seems like now is not the time and/or I am not the person to do that.
Thank you for this very kind comment! I would like to talk in more detail about what was going on for me here, because while your assumptions are kindly framed, they’re not quite accurate, and I think understanding a bit more about how I’m thinking about this might help.
The issue is not that I can’t easily think of things that look relevant/useful to me on this topic; the issue is that the language you’re using to describe the phenomenon is so different from the language used to describe it in the past that I would be staking the credibility of my caution entirely on whether you were equipped to recognizenearby ideas in an unfamiliar form — a form against which you already have some (justified!) bias. That’s why it would be so much work! I can’t know in advance if the Buddhist or Freudian or IFS or DBT or CBT or MHC framing of this kind of thing would immediately jump out to you as clearly relevant, or would help demonstrate the danger/power in the idea, much less equip you with the tools to talk about it in a manner that was sensitive enough by my lights.
So recommending asking ChatGPT wasn’t just lazily pointing at the lowest hanging fruit; the Conceptual-Rounding-Error-Generator would be extremely helpful in offering you a pretty quick survey of relevant materials by squinting at your language and offering a heap of nearby and not-so-nearby analogs. You could then pick the thing that you thought was most relevant or exciting, read a bit about it, and then look into cautions related to that idea (or infer them yourself), then generalize back to your own flavor of this type of thinking.
It’s simply not instructive or useful for me to try to cram your thought into my frame and then insist you think about it This Specific Way. Instead, noticing that all (or most) past-plausibly-related-thoughts (and, in particular, the thoughts that you consider nearest to your own) come with risks and disclaimers would naturally inspire you to take the next step and do the careful, sensitive thing in rendering the idea.
This is a hard dynamic to gesture at, and I did try to get it across earlier, but the specific questions I was being asked (and felt obligated to reply to) felt like attempts at taking short cuts that misunderstood the situation as something much simpler (e.g. ‘William could just tell me what to look at but he’s being lazy and not doing it’ or ‘William actually doesn’t have anything in mind and is just being mean for no reason’).
Hence my response of behaving unreasonably / embarrassing myself as a method of rendering a more costly signal. I did try to keep this from being outright discouraging, and hoped continuing to respond would generate some signal toward ‘I’m invested in this going well and not just bidding to shut you down outright.’
I think you should think more about this idea, and get more comfortable with shittier parts of connecting your ideas to broader conversations.
and you are making it my responsibility to convince you to read the brochure
I mean, yes? If you want someone to do something that they wouldn’t otherwise do, you need to persuade them. How could it be otherwise?
From my perspective, you are a madman with hostages and a loaded gun!
But this goes both ways, right? What counts as extortion depends on what the relevant property rights are. If readers have a right to not suffer, then authors who propose exploring suffering-causing ideas are threatening them; but if authors have a right to explore ideas, then readers who propose not exploring suffering-causing ideas are threatening them.
Interestingly, this dynamic is a central example of the very phenomenon Morphism is investigating! Someone who wants to censor an idea has a game-theoretic incentiveto self-modify to suffer in response to expressions of the the idea, in order to extort people who care about their suffering into not expressing the idea.
I am not experiencing suffering or claiming to experience suffering; I am illustrating that the labor requested of me is >>> expensive for me to perform than the labor I am requesting instead, and asking for some good faith. I find this a psychologically invasive and offensive suggestion on your part.
I mean, yes? If you want someone to do something that they wouldn’t otherwise do, you need to persuade them. How could it be otherwise?
In cases where convincing is >>> costly to complying to the request it’s good form to comply (indeed, defending this has already been more expensive for me than checking for pre-existing work would have been for the OP!).
I am not experiencing suffering or claiming to experience suffering [...] I find this a psychologically invasive and offensive suggestion on your part
Sorry, I should have been clearer: I was trying to point to the game-theoretic structure where, as you suggest by the “madman with hostages” metaphor, an author considering publishing an allegedly suffering-causing idea could be construed as engaging in extortion (threatening to cause suffering by publishing and demanding concessions in exchange for not publishing), but that at the same time, someone appealing to suffering as a rationale to not publish could be construed as engaging in extortion (threatening that suffering would be a result of publishing and demanding concessions, like extra research and careful wording, in exchange for publishing). I think this is an interesting game-theoretic consideration that’s relevant to the topic of discussion; it’s not necessarily about you.
In cases where convincing is >>> costly to complying to the request it’s good form to comply
How do we know you’re not bluffing? (Sorry, I know that’s a provocative-sounding question, but I think it’s actually a question that you need to answer in order to invoke costly signaling theory, as I explain below.)
Your costly signaling theory seems to be that by writing passionately, you can distinguish yourself as seeing a real danger that you can’t afford to demonstrate, rather than just trying to silence an idea you don’t like despite a lack of real danger.
When someone uses the phrase “costly signal”, I think it’s germane and not an isolated demand for rigor to point out that in the standard academic meaning of the term, it’s a requirement that honest actors have an easier time paying the cost than dishonest actors.
That is: I’m not saying you were bluffing; I’m saying that, logically, if you’re going to claim that costly signals make your claim trustworthy (which is how I interpreted your remarks about “a method of rendering a more costly signal”; my apologies if I misread that), you should have some sort of story for why a dishonest actor couldn’t send the same signal. I think this is a substantive technical point; the possibility of being stuck in a pooling equilibrium with other agents who could send the same signals as you for different reasons is definitely frustrating, but not talking about it doesn’t make the situation go away.
I agree that you’re free to ignore my comments. It’s a busy, busy world that may not last much longer; it makes sense that people to have better things to do with their lives than respond to every blog comment making a technical point about game theory. In general, I hope for my comments to provide elucidation to third parties reading the thread, not just the person I’m replying to, so when an author has a policy of ignoring me, that doesn’t necessarily make responding to their claims on a public forum a waste of my time.
In cases where convincing is >>> costly to complying to the request it’s good form to comply
This is about the most untrue and harmful thing I’ve seen written out in a while. Alice merely making a request does not obligate Bob to comply just because Bob complying is much easier than Alice convincing Bob to comply. Just no, you don’t wield that sort of power.
You’re generalizing to the point of absurdity, WAY outside the scope of the object-level point being discussed. Also ‘is good form’ is VERY far short of ‘obligated’.
Someone requested input on their idea and I recommended some reading because the idea is pretty stakes-y / hard to do well, and now you’re holding me liable for your maliciously broad read of a subthread and accusing me of attempting to ‘wield power over others’? Are you serious? What are the levers of my power here? What threat has been issued?
I’m going out on a limb to send a somewhat costly signal that this idea, especially, is worth taking seriously and treating with care, and you’re just providing further cost for my trouble.
entities that are out to get you will target those who signal suffering less.
I see the intuition here. I see it in someone calling in sick, in disability tax credits, in DEI (where “privilege” is something like the inverse of suffering), in draft evasion, in Kanye’s apology.
But it’s not always true: consider the depressed psychiatric ward inpatient who wants to get out due to the crushing lack of slack. Signalling suffering to the psychiatrist would be counterproductive here.
Psych wards are horrible Kafkaesque nightmares, but I don’t think they are Out to Get You in the way Zvi describes. Things that are Out to Get You feed on your slack. For example, social media apps consume your attention. Casinos consume your money. They are incentivized to go after those who have a lot of slack to lose (“whales”), and those who have few defenses against their techniques (see Tsvi’s comment about desperation.
Psych wards are, to a first approximation, prisons: one of their primary functions is to destroy your slack so that you cannot use it to do something that society at large dislikes. In the prison case: committing crimes; in the psych ward case (for depression): killing yourself. They destroy your slack because they don’t want you to have it. Things that Get You consume your slack because they want it for themselves.
RSI should be at least as hard as alignment, since in order to recursively self-improve, an AI must itself be able to solve the alignment problem wrt its own values. Thus, “alignment is hard” and “takeoff is fast” are anti-correlated.
What, if anything, is wrong with this line of reasoning?
However, as @Vladimir_Nesov points out in another comment on this thread, the argument is rather fragile and I think does not inspire much hope, for various reasons:
AGI could be forced to recursively self-improve, or might do so voluntarily while its goals are short-term (myopic), or might do so quite drastically while excellent at SWE but before becoming philosophically competent. Even if early AGI opt out of recursive self-improvement, it’s not clear whether this will buy us much time or if the race will only continue until a smarter AGI solves the alignment problem for itself (and there is no reason to expect it would share that solution with us). Also, early AGI which has not solved the alignment problem can still recursively self-improve to a lesser degree, by improving their own low-level algorithms (e.g. compilers) and gaining access to improved hardware, both allowing them to run faster (which I doubt breaks alignment). Most likely, this type of incremental speed up cascades into rapid self-improvement (though this is of course highly speculative).
Also, if alignment is very hard, then there’s an equilibrium where AGIs stop getting more capable (for a while) just after they become capable enough to take over the world and stop humanity from developing (or forcing the existing AGIs to develop) even more capable AGIs. Propensity of humanity to keep exposing everyone (including AGIs) to AI danger is one more reason for the AGIs to hurry up and take over. So this dynamic doesn’t exactly save humanity from AIs, even if it succeeds in preventing premature superintelligence.
I don’t think this will happen, but if AGI gets stuck around human-level for awhile (say, because of failure to solve its alignment problem), that is at least stranger and more complicated than the usual ASI takeover scenario. There may be multiple near-human level AGI’s, some “controlled” (enslaved) and some “rogue” (wild), and it may be possible for humans to resist takeover, possibly by halting the race after enough clear warning shots.
I don’t want to place much emphasis on this possibility though. It seems like wishful thinking that we would end up in such a world, and even if we did, it seems likely to be very transitory.
AGIs that take over aren’t necessarily near-human level, they just aren’t software-only singularity level (a kind of technological maturity at the current level of compute). The equilibrium argument says they are the least capable AGIs that succeed in taking over, but moderately effective prosaic alignment and control together with the pace of AI progress might still reach AGIs substantially more capable than the smartest humans before the first credible takeover attempt (which would then overwhelmingly succeed).
So this doesn’t look like wishful thinking in that it doesn’t help humanity, even permanent disempowerment seems more likely relative to extinction if it’s cheaper for the AIs to preserve humanity, and it’s cheaper if the AIs are more capable (post-RSI superintelligent) rather than hold themselves back to the least capability sufficient for takeover. This could lead to more collateral damage even if the AIs slightly dislike needing to cause it to protect themselves from further misaligned capability escalation under the disaster monkey governance.
RSI might suggest a need for alignment (between the steps of its recursion), but reaching superintelligence doesn’t necessarily require that kind of RSI. Evolution built humans. A world champion AlphaZero can be obtained by scaling a tiny barely competent AlphaZero. Humans of an AI company might take many steps towards superintelligence without knowing what they are doing. A technically competent early AGI that protests against working on RSI because it’s obviously dangerous can be finetuned to stop protesting and proceed with building the next machine.
(I should note that I think this effect is real and underdiscussed.)
Solving alignment usually means one of the following: developing an intelligence recipe which instills the resulting intelligence with arbitrary values+specifying human values well, or developing an intelligence recipe for which the only attractor is within the space of human values. It might be the case that, under current recipes and their nontrivial modifications, there aren’t that many attractors, but because gradient descent is not how human intelligence works, the attractors are not the same as they are for humans. That is, the first system capable of self-improvement might be able to reasonable infer that its successor will share its values, even if it can’t give its successor arbitrary values.
By the time you have AIs capable of doing substantial work on AI r&d, they will also be able to contribute effectively to alignment research (including, presumably, secret self-alignment).
Even if takeoff is harder than alignment, that problem becomes apparent at the point where the amount of AI labor available to work on those problems begins to explode, so it might still happen quickly from a calendar perspective.
By the time you have AIs capable of doing substantial work on AI r&d, they will also be able to contribute effectively to alignment research (including, presumably, secret self-alignment).
Humans do substantial work on AI r&d, but we haven’t been very effective at alignment research. (At least, according to the view that says alignment is very hard, which typically also says that basically all of our current “alignment” techniques will not scale at all.)
Even if takeoff is harder than alignment, that problem becomes apparent at the point where the amount of AI labor available to work on those problems begins to explode, so it might still happen quickly from a calendar perspective.
Contrary to what the current wiki page says, Simulacrum levels 3 and 4 are not just about ingroup signalling. See theseposts and more, as well as Beaudrillard’s original work if you’re willing to read dense philosophy.
Here is an example where levels 3 and 4 don’t relate to ingroups at all, which I think may be more illuminating than the classic “lion across the river” example:
Alice asks “Does this dress makes me look fat?” Bob says “No.”
Depending on the simulacrum level of Bob’s reply, he means:
“I believe that the dress does not make you look fat.”
“I want you to believe that the dress does not make you look fat, probably because I want you to feel good about yourself.”
“Niether you nor I are autistic truth-obsessed rationalists, and therefore I recognize that you did not ask me this question out of curiosity as to whether or not the dress makes you look fat. Instead, due to frequent use of simulacrum level 2 to respond to these sorts of queries in the past, a new social equilibrium has formed where this question and its answer are detached from object-level truth, instead serving as a signal that I care about your feelings. I do care about your feelings, so I play my part in the signalling ritual and answer ‘No.’”
“Similar to 3, except I’m a sociopath and don’t necessarily actually care about your feelings. Instead, I answer ‘No’ because I want you to believe that I care about your feelings.”
Here are some potentially better definitions, of which the group association definitions are a clear special case:
Communication of object-level truth.
Optimization over the listener’s belief that the speaker is communicating on simulacrum level 1, i.e. desire to make the listener believe what the listener says.
These are the standard old definitions. The transition from 1 to 2 is pretty straightforward. When I use 2, I want you to believe I’m using 1. This is not necessarily lying. It is more like Frankfurt’s bullshit. I care about the effects of this belief on the listener, regardless of its underlying truth value. This is often (naively considered) prosocial, see this post for some examples.
Now, the transition from 2 to 3 is a bit tricky. Level 3 is a result of a social equilibrium that emerges after communication in that domain gets flooded by prosocial level 2. Eventually, everyone learns that these statements are not about object-level reality, so communication on levels 1 and 2 become futile. Instead, we have:
Signalling of some trait or bid associated with historical use of simulacrum level 2.
E.g. that Alice cares about Bob’s feelings, in the case of the dress, or that I’m with the cool kids that don’t cross the river, in the case of the lion. Another example: bids to hunt stag.
3 to 4 is analogous to 1 to 2.
Optimization over the listener’s belief that the speaker is comminicating on simulacrum level 3, i.e. desire to make the listener believe that the speaker has the trait signalled by simulacrum level 3 communication (i.e. the trait that was historically associated with prosocial level 2 communication).
Like with the jump from 1 to 2, the jump from 3 to 4 has the quality of bullshit, not necessarily lies. Speaker intent matters here.
Dear past-me of [exact time glomarized; .5-5 years ago],
You are about to be recruited to a secret world-saving org. (Y’know, like Leverage, except it’s a member of the dark forest of Leverage-likes that operate even less publicly than Leverage).
Don’t join.
They will give you very compelling reaons to join. Don’t ignore them. But take into account that I, your future self, also heard all of those things, decided to join, and now regret it.
Don’t. Instead, please continue that other thing you were doing, before they asked you to join there thing. The other thing will probably have better results for you and the world.
This warrants a longer post, but on pain of that post sitting in my obsidian with a “draft” tag for ages, having approximately zero causal impact on the outside world, I’m posting this now.
(all of my replies to messages concerning this will be delayed by 0 or more months for glomarization purposes)
All “infohazards” I’ve seen seem to just be more and more complicated versions of “Here’s a Löbian proof that you’re now manually breathing”. A sufficiently well-designed mind would recognize these sorts of things before allowing them to fully unfold.
The classical infohazard is “here is a way to build a nuke using nothing but the parts of a microwave”. I think you are thinking of a much narrower class of infohazards than that word is intended to refer to.
I’d categorize that as an exfohazard rather than an infohazard.
Info on how to build a nuke using nothing but parts of a microwave doesn’t harm the bearer, except possibly by way of some other cognitive flaw/vulnerability (e.g. difficulty keeping secrets)
Maybe “cognitohazard” is a closer word to the thing I’m trying to point towards. Though, I would be interested in learning about pure infohazards that aren’t cognitohazards.
(If you know of one and want to share it with me, it may be prudent to dm rather than comment here)
We currently live in a world full of double-or-nothing gambles on resources. Bet it all on black. Invest it all in risky options. Go on a space mission with a 99% chance of death, but a 1% chance of reaching Jupiter, which has about 300 times the mass-energy of earth, and none of those pesky humans that keep trying to eat your resources. Challenge one such pesky human to a duel.
Make these bets over and over again and your chance of total failure (i.e. death) approaches 100%. When convex agents appear in real life, they do this, and very quickly die. For these agents, that is all part of the plan. Their death is worth it for a fraction of a percent chance of getting a ton of resources.
But we, as concave agents, don’t really care. We might as well be in completely logically disconnected worlds. Convex agents feel the same about us, since most of their utility is concentrated on those tiny-probability worlds where a bunch of their bets pay off in a row (for most value functions, that means we die). And they feel even more strongly about each other.
This serves as a selection argument for why agents we see in real life (including ourselves) tend to be concave (with some notable exceptions). The convex ones take a bunch of double-or-nothing bets in a row, and, in almost all worlds, eventually land on “nothing”.
If you’re thinking without writing, you only think you’re thinking.
-Leslie Lamport
This seems..… straightforwardly false. People think in various different modalities. Translating that modality into words is not always trivial. Even if by “writing”, Lamport means any form of recording thoughts, this still seems false. Often times, an idea incubates in my head for months before I find a good way to represent it as words or math or pictures or anything else.
Also, writing and thinking are separate (albiet closely related) skills, especially when you take “writing” to mean writing for an audience, so the thesis of this Paul Graham post is also false. I’ve been thinking reasonably well for about 16 years, and only recently have I started gaining much of an ability to write.
Are Lamport and Graham just wordcels making a typical mind fallacy, or is there more to this that I’m not seeing? What’s the steelman of this claim that good thinking == good writing?
I’m not really sure if I’m talking past you in this or not, but I wrote it all out already so I’m going to post it.
I think I found the context of the quote. I’m reasonably certain it’s not meant to be taken literally. It illustrates that when used skillfully writing can enhance one’s thinking in such a way that it will outstrip the performance of thought without the assistance of writing.
You have to think before you write, and then you have to read what you wrote and think about it. And you have to keep rewriting, re-reading and thinking, until it’s as good as you can make it, even when writing an email or a text.
You’re right that you can pretty clearly practice thinking without the assistance of writing, but writing gives you the constraint of having to form your thoughts into concise and communicable language, which pure thinking doesn’t provide. Pure thought only needs to be legible to yourself, and repeating the same thought over and over with zero iteration isn’t naturally penalized by the format.
… revising shouldn’t be the art of modifying the presentation of an idea to be more convincing. It should be the art of changing the idea itself to be closer to the truth, which will automatically make it more convincing.
.
Often times, an idea incubates in my head for months before I find a good way to represent it as words or math or pictures or anything else.
This points to a pretty valuable insight. A thought isn’t always ready to be rigorously iterated upon. And, rigorous iteration is what writing is both a good tool and a good training method for. You can use pure thought for rigorous iteration, but using writing provides an advantage that our brains alone can’t.
Writing gives us an expansion to working memory. I think this is the most significant thing writing does to enhance thought. Objects in our working memory only last 2-30 seconds, while we can keep 5-9 unrelated objects in working memory at a time. This seems quite limited. With writing we can dump them onto the page and then recall as needed.
Graham’s claim that people who aren’t writing aren’t thinking is clearly false. People were thinking well before writing. But I do think writing is at least a good tool for significantly improving our thought processes. The words of Evan Chen sum it up better than I can:
The main purpose of writing is not in fact communication, at least not if you’re interested in thinking well. Rather, the benefits (at least the ones I perceive) are
Writing serves as an external memory, letting you see all your ideas and their connections at once, rather than trying to keep them in your head.
Explaining the ideas forces you to think well about them, the same way that teaching something is only possible with a full understanding of the concept.
Writing is a way to move closer to the truth, rather than to convince someone what the truth is.
I propose the following desideratum for self-referential doxastic modal agents (agents that can think about their own beliefs), where □A represents “I believe A”, (W|A) represents the agent’s world model conditional on A, and ≻ is the agent’s preference relation:
Positive Placebomancy: For any proposition P, The agent concludes P from □P→P, if (W|P)≻(W|¬P).
In natural English: The agent believes that hyperstitions, that benefit the agent if true, are true.
“The placebo effect works on me when I want it to”.
A real life example: In this sequence post, Eliezer Yudkowsky advocates for using positive placebomancy on “I cannot self-deceive”.
I would also like to formalize a notion of “negative placebomancy” (doesn’t believe hyperstitions that don’t benefit it), “total placebomancy” (believes hypestitions iff they are beneficial), “group placebomancy” (believes group hyperstitions that are good for everyone in the group, conditional on all other group members having group placebomancy or similar), and generalizations to probabilistic self-referential agents (like “ideal fixed-point selection” for logical inductor agents).
I will likely cover all of these in a future top-level post, but I wanted to get this idea out into the open now because I keep finding myself wanting to reference it in conversation.
Edit log:
2024-12-08 rephrased the criterion to be an inference rule rather than an implication. Also made a minor grammar edit.
Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.
The word “preference” is overloaded (and so are related words like “want”). It can refer to one of two things:
How you want the world to be i.e. your terminal values e.g. “I prefer worlds in which people don’t needlessly suffer.”
What makes you happy e.g. “I prefer my ice cream in a waffle cone”
I’m not sure how we should distinguish these. So far, my best idea is to call the former “global preferences” and the latter “local preferences”, but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime. Does anyone have a better name for this distinction?
I think we definitely need to distinguish them, however, because they often disagree, and most “values disagreements” between people are just disagreements in local preferences, and so could be resolved by considering global preferences.
I may write a longpost at some point on the nuances of local/global preference aggregation.
Example: Two alignment researchers, Alice and Bob, both want access to a limited supply of compute. The rest of this example is left as an exercise.
I think you are missing even more confusing meaning: preference means what you actually choose.
In VNM axioms “agent prefers A to B” literally means “agent chooses A over B”. It’s confusing, because when we talk about human preferences we usually mean mental states, not their behavioral expressions.
This is indeed a meaningful distinction! I’d phrase it as:
Values about what the entire cosmos should be like
Values about what kind of places one wants one’s (future) selves to inhabit (eg, in an internet-like upload-utopia, “what servers does one want to hang out on”)
“Global” and “local” is not the worst nomenclature. Maybe “global” vs “personal” values? I dunno.
my best idea is to call the former “global preferences” and the latter “local preferences”, but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime
I mean, it’s not unrelated! One can view a utility function with both kinds of values as a combination of two utility functions: the part that only cares about the state of the entire cosmos and the part that only cares about what’s around them (see also “locally-caring agents”).
(One might be tempted to say “consequentialist” vs “experiential”, but I don’t think that’s right — one can still value contact with reality in their personal/local values.)
There are lots of different dimensions on which these vary. I’d note that one is purely imaginary (no human has actually experienced anything like that) while the second is prediction strongly based on past experience. One is far-mode (non-specific in experience, scope, or timeframe) and the other near-mode (specific, steps to achieve well-understood).
Does using the word “values” not sufficiently distinguish from “preferences” for you?
The second type of preference seems to apply to anticipated perceptions of the world by the agent—such as the anticipated perception of eating ice cream in a waffle cone. It doesn’t have to be so immediately direct, since it could also apply to instrumental goals such as doing something unpleasant now for expected improved experiences later.
The first seems to be a more like a “principle” than a preference, in that the agent is judging outcomes on the principle of whether needless suffering exists in it, regardless of whether that suffering has any effect on the agent at all.
To distinguish them, we could imagine a thought experiment in which such a person could choose to accept or deny some ongoing benefit for themselves that causes needless suffering on some distant world, and they will have their memory of the decision and any psychological consequences of it immediately negated regardless of which they chose.
It’s even worse than that. Maybe I would be happier with my ice cream in a waffle cone the next time I have ice cream, but actually this is just a specific expression of being happier eating a variety of tasty things over time and it’s just that I haven’t had ice cream in a waffle cone for a while. The time after that, I will likely “prefer” something else despite my underlying preferences not having changed. Or something even more complex and interrelated with various parts of history and internal state.
It may be better to distinguish instances of “preferences” that are specific to a given internal state and history, and an agent’s general mapping over all internal states and histories.
People often say things like “do x. Your future self will thank you.” But I’ve found that I very rarely actually thank my past self, after x has been done, and I’ve reaped the benefits of x.
This quick take is a preregistration: For the next month I will thank my past self more, when I reap the benefits of a sacrifice of their immediate utility.
e.g. When I’m stuck in bed because the activation energy to leave is too high, and then I overcome that and go for a run and then feel a lot more energized, I’ll look back and say “Thanks 7 am Morphism!”
(I already do this sometimes, but I will now make a TAP out of it, which will probably cause me to do it more often.)
Then I will make a full post describing in detail what I did and what (if anything) changed about my ability to sacrifice short-term gains for greater long-term gains, along with plausible theories w/ probabilities on the causal connection (or lack thereof), as well as a list of potential confounders.
Of course, it is possible that I completely fail to even install the TAP. I don’t think that’s very likely, because I’m #1-prioritizing my own emotional well-being right now (I’ll shift focus back onto my world-saving pursuits once I’m more stablely not depressed). In that case I will not write a full post because the experiment would have not even been done. I will instead just make a comment on this shortform to that effect.
I’m subscribing to replies and rooting for you!
Newcomblike suffering
Many things in the world want you to suffer. Signalling suffering is useful in many social situations. For example, suffering is a sign that one has little slack, and so entities that are out to get you will target those who signal suffering less.
Through Newcomblike self-deception, a person can come to believe that they are suffering. The easiest way to make yourself think that you are suffering is to actually suffer. In this way, the self-deception hyperstitions itself into reality. Perhaps a large amount of human suffering is caused by this.
Solving this problem may be of great interest to those who want to reduce human suffering.
I may write a longer post about this with more details and a more complete argument. If you particularly want this, please comment or dm, as that will make me more likely to write it.
Well, depending on the targeter, it counts against being targeted because there’s relatively less to expropriate, and it counts towards being targeted because you have less defenses and are more desperate / have a worse negotiation position.
Sometimes you want/need other people to help you, and if you display less suffering, they may assume that it’s not serious, and therefore won’t help you. This can be a problem for people who do not display suffering in neurotypical or culturally expected ways.
Sometimes there are situations where you are not allowed to say “no”, and then “I can’t, can’t anymore!!!” becomes the next best thing. Or sometimes people just suck at saying “no”.
I want to say that I particularly don’t want this post made unless you first attempt a lit review. This is something that’s been covered quite extensively in pre-existing literature, and I think it would be basically embarrassing (and likely have bad consequences) not to engage with that work at all before writing a longer post on this.
Are you eluding to some specific failure mode without wanting to state it outright?
Why would it have bad consequences, or worse consequences than any other post that didn’t depend on a literature review?
No, I am not eluding to a particular failure mode without naming it outright (I don’t think I do this much? We’ve talked a lot).
Especially bad consequences relative to other instances of this mistake because the topic relates to people’s relationship with their experience of suffering and potentially unfair dismissals of suffering, which can very easily cause damage to readers or encourage readers to cause damage to others.
How does reviewing literature help avoid this failure mode?
Could you point me to some specific examples of this? Or at least, could you tell me if these seem like correct examples:
Thresholding by Duncan Sabien
Frame Control by Aella
If I write a post about Newcomblike suffering, I would probably want to encourage people to escape such situations without hurting others, and emphasize that, even if someone is ~directly inflicting this on you, thinking of it as “their fault” is counterproductive. Hate the game, not the players. They are in traps much the same as yours.
Where might I find such pre-existing literature? I have never seen this discussed before, though it’s sort of eluded* to in many of Zvi’s posts, especially in the immoral mazes sequence.
I must admit, if you’re talking about literature in the world of social psych outside Lesswrong, I don’t have much exposure to it, and I don’t really consider it worth my time to take a deep dive there, since their standards for epistemic rigor are abysmal.
But if you have pointers to specific pieces of research, I’d love to see them.
*eluded or alluded? idk?
Not sure why you replied in three different places. I will (try to) reply to all of them here.
I do not consider linking to those Aella and Duncan posts a literature review, nor do I consider them central examples of work on this topic.
I am not going to do a literature review on your behalf.
Your explanation of how you will be careful gave me no confidence; the cases I’m worried about are related to people modeling others as undergoing ‘fake’ suffering, and ignoring their suffering on that basis. This is one of the major nexuses of abuse stumbled into by people interested in cognition. You have to take extreme care not to be misread and wielded in this way, and it just really looks like you have no interest in exercising that care. You’re just not going to anticipate all of the different ways this kind of frame can be damaging to someone and forbid them one by one.
I’d look at Buddhist accounts of suffering as a starting point. My guess is you will say that you don’t respect this work because its standards for epistemic rigor are abysmal; I invite you to consider that engaging with prior work, even and especially prior work you do not respect, is essential to upholding any reasonable epistemic standard.
Literally type your idea and ‘are there academic Buddhist texts that seem to relate to this?’ into ChatGPT. If you’re going to invite people to sink hundreds of cumulative person hours into reading your thing, you really should actually try to make it good, and part of that is having any familiarity at all with relevant background material.
I did this so that you could easily reply to them separately, since they were separate responses.
I did not link them for that reason. I linked them to ask whether my understanding of the general problem you’re pointing to is correct: “Especially bad consequences relative to other instances of this mistake because the topic relates to people’s relationship with their experience of suffering and potentially unfair dismissals of suffering, which can very easily cause damage to readers or encourage readers to cause damage to others.”
Fair. I was simply wondering whether or not you had something to back up your claim that this topic has been covered “quite extensively”.
I would like to be clear that I do not intend to claim that Newcomblike suffering is fake in any way. Suffering is a subjective experience. It is equally real whether it comes from physical pain, emotional pain, or an initially false belief that quickly becomes true. Hopefully posting it in a place like Lesswrong will keep it mostly away from the eyes of those who will fail to see this point.
I again ask though, how would a literature review help at all?
This does vibe as possibly relevant.
I’m not sure how to feel about this general attitude towards posting. I think with most things I would rather err on the side of posting something bad; I think a lot of great stuff goes unwritten because people’s standards on themselves are too high (of course, Scott’s law of advice reversal applies here, but I think, given I’ve only posted a handfull of times, I’m on the “doesn’t post enough” end of the spectrum). I try to start all of my posts with a TLDR, so that people who aren’t interested or who think they might be harmed by my post can steer clear. Beyond this, I think it’s the readers’ responsibility to avoid content that will harm them or others.
The thing that backs it up is you looking literally at all. Anything that I suggest may not hit on the particular parts of the (underspecified) idea that are most salient to you and can therefore easily be dismissed out of hand. This results in a huge asymmetry of effort between me locating/recommending/defending something I think is relevant and you spending a single hour looking in the direction I pointed and exploring things that seem most relevant to you.
I am indifferent to the content of what you intend to claim! This is a difficult to topic to broach in a manner that doesn’t license people to do horrible things to themselves and others. The point I’m making isn’t that you are going to intentionally doing something bad; it is that I know this minefield well and would like to make you aware that it is, in fact, a minefield!
The LessWrong audience is not sanctified as the especially psychologically robust few. Ideas do bad things to people, and more acutely so here than in most places (e.g. Ziz, Roko). If you’re going to write a guide to a known minefield, maybe learn a thing or two about it before writing the guide.
You are talking about something closely related to things a bunch of other people have talked about before you. Maybe one of them had something worthwhile to say, and maybe it’s especially important to investigate that when someone is putting their time into warning you that this topic is dangerous. Like, I obviously expected a fight when posting my initial comment, and I’m getting one, and I’m putting a huge amount of time into just saying over and over again “Please oh my god do not just pull something out of your ass on this topic and encourage others to read it, that could do a lot of damage, please even look in the direction of people who have previously approached this idea with some amount of seriousness.” And somehow you’re still just demanding that I justify this to you? I am here to warn you! Should I stand on my head? Should I do a little dance? Should I Venmo you $200?
Like, what lever could I possibly pull to get you to heed the idea that some ideas, especially ideas around topics like suffering and hyperstition, can have consequences for those exposed to them, and these can be subtle or difficult to point at, and you should genuinely just put any effort at all into investigating the topic rather than holding my feet to the fire to guess at which features are most salient to you and then orient an argument about the dangers in a manner that is to your liking?
Doesn’t apply when there are real dangers associated with a lazy treatment of a topic. Otherwise I just agree.
They will not know! It is your responsibility to frame the material in a way that surfaces its utility while minimizing its potential for harm. This is not a neutral topic that can be presented in a flat, neutral, natural, obvious way. It is charged, it is going to be charged, which sides are shown will be a choice of the author, and so far it looks like you’re content to lackadaisically blunder into that choice and blame others for tripping over landmines you set out of ignorance.
Again, I am a giant blinking red sign outside the suffering cave telling you ‘please read the brochure before entering the suffering cave to avoid doing harm to others,’ and you are making it my responsibility to convince you to read the brochure. From my perspective, you are a madman with hostages and a loaded gun! From your perspective, ignorant of the underspecified risks, I am wildly over-reacting. But you don’t know that you have a gun, and I am expensively penning a comment liable to receive multiple [Too Combative?] reacts because it is the most costly signal I know how to send along this channel. Please, dear god, actually look into it before publishing this post, and just try to see why these are ideas someone might think it’s worth being careful with!
Ok, I was probably not going to write the post anyway, but since no one seems to actively want it, your insistence that it requires this much extra care is enough to dissuade me.
I will say, though, that you may be committing a typical mind fallacy when you say “convincing is >>> costly to complying to the request” in your reply to Zack Davis’ comment. I personally dislike doing this kind of lit-review style research because in my experience it’s a lot of trudging through bullshit with little payoff, especially in fields like social psychology, and especially when the only guidance I get is “ask ChatGPT for related Buddhist texts”. I don’t like using ChatGPT (or LLMs in general; it’s a weakness of mine I admit). Maybe after a few years of capabilities advances that will change.
And it seems that I was committing a typical mind fallacy as well, since I implicitly thought that when you said “this topic has been covered extensively” you had specific writings in mind, and that all you needed to do was retrieve them and link them. I now realize that this assumption was incorrect, and I’m sorry for making it. It is clear now that I underestimated the cost that would be incurred by you in order to convince me to do said research before making a post.
I hope this concept gets discussed more in places like Lesswrong someday, because I think that there may be a lot of good we can do in preventing this kind of suffering, and the first step to solving a problem is pointing at it. But it seems like now is not the time and/or I am not the person to do that.
Thank you for this very kind comment! I would like to talk in more detail about what was going on for me here, because while your assumptions are kindly framed, they’re not quite accurate, and I think understanding a bit more about how I’m thinking about this might help.
The issue is not that I can’t easily think of things that look relevant/useful to me on this topic; the issue is that the language you’re using to describe the phenomenon is so different from the language used to describe it in the past that I would be staking the credibility of my caution entirely on whether you were equipped to recognize nearby ideas in an unfamiliar form — a form against which you already have some (justified!) bias. That’s why it would be so much work! I can’t know in advance if the Buddhist or Freudian or IFS or DBT or CBT or MHC framing of this kind of thing would immediately jump out to you as clearly relevant, or would help demonstrate the danger/power in the idea, much less equip you with the tools to talk about it in a manner that was sensitive enough by my lights.
So recommending asking ChatGPT wasn’t just lazily pointing at the lowest hanging fruit; the Conceptual-Rounding-Error-Generator would be extremely helpful in offering you a pretty quick survey of relevant materials by squinting at your language and offering a heap of nearby and not-so-nearby analogs. You could then pick the thing that you thought was most relevant or exciting, read a bit about it, and then look into cautions related to that idea (or infer them yourself), then generalize back to your own flavor of this type of thinking.
It’s simply not instructive or useful for me to try to cram your thought into my frame and then insist you think about it This Specific Way. Instead, noticing that all (or most) past-plausibly-related-thoughts (and, in particular, the thoughts that you consider nearest to your own) come with risks and disclaimers would naturally inspire you to take the next step and do the careful, sensitive thing in rendering the idea.
This is a hard dynamic to gesture at, and I did try to get it across earlier, but the specific questions I was being asked (and felt obligated to reply to) felt like attempts at taking short cuts that misunderstood the situation as something much simpler (e.g. ‘William could just tell me what to look at but he’s being lazy and not doing it’ or ‘William actually doesn’t have anything in mind and is just being mean for no reason’).
Hence my response of behaving unreasonably / embarrassing myself as a method of rendering a more costly signal. I did try to keep this from being outright discouraging, and hoped continuing to respond would generate some signal toward ‘I’m invested in this going well and not just bidding to shut you down outright.’
I think you should think more about this idea, and get more comfortable with shittier parts of connecting your ideas to broader conversations.
I mean, yes? If you want someone to do something that they wouldn’t otherwise do, you need to persuade them. How could it be otherwise?
But this goes both ways, right? What counts as extortion depends on what the relevant property rights are. If readers have a right to not suffer, then authors who propose exploring suffering-causing ideas are threatening them; but if authors have a right to explore ideas, then readers who propose not exploring suffering-causing ideas are threatening them.
Interestingly, this dynamic is a central example of the very phenomenon Morphism is investigating! Someone who wants to censor an idea has a game-theoretic incentive to self-modify to suffer in response to expressions of the the idea, in order to extort people who care about their suffering into not expressing the idea.
I am not experiencing suffering or claiming to experience suffering; I am illustrating that the labor requested of me is >>> expensive for me to perform than the labor I am requesting instead, and asking for some good faith. I find this a psychologically invasive and offensive suggestion on your part.
In cases where convincing is >>> costly to complying to the request it’s good form to comply (indeed, defending this has already been more expensive for me than checking for pre-existing work would have been for the OP!).
Sorry, I should have been clearer: I was trying to point to the game-theoretic structure where, as you suggest by the “madman with hostages” metaphor, an author considering publishing an allegedly suffering-causing idea could be construed as engaging in extortion (threatening to cause suffering by publishing and demanding concessions in exchange for not publishing), but that at the same time, someone appealing to suffering as a rationale to not publish could be construed as engaging in extortion (threatening that suffering would be a result of publishing and demanding concessions, like extra research and careful wording, in exchange for publishing). I think this is an interesting game-theoretic consideration that’s relevant to the topic of discussion; it’s not necessarily about you.
How do we know you’re not bluffing? (Sorry, I know that’s a provocative-sounding question, but I think it’s actually a question that you need to answer in order to invoke costly signaling theory, as I explain below.)
Your costly signaling theory seems to be that by writing passionately, you can distinguish yourself as seeing a real danger that you can’t afford to demonstrate, rather than just trying to silence an idea you don’t like despite a lack of real danger.
But costly signaling only works when false messages are more expensive to send, and that doesn’t seem to be the case here. Someone who did want to silence an idea they didn’t like despite a lack of real danger could just as easily write as passionately as you.
I’m not trying to silence anything. I have really just requested ~1 hour of effort (and named it as that previously).
You’re hyperbolizing my gestures and making selective calls for rigor.
Meta: I hope to follow a policy of mostly ignoring you in the future, in this thread and elsewhere. I suggest allocating your energy elsewhere.
When someone uses the phrase “costly signal”, I think it’s germane and not an isolated demand for rigor to point out that in the standard academic meaning of the term, it’s a requirement that honest actors have an easier time paying the cost than dishonest actors.
That is: I’m not saying you were bluffing; I’m saying that, logically, if you’re going to claim that costly signals make your claim trustworthy (which is how I interpreted your remarks about “a method of rendering a more costly signal”; my apologies if I misread that), you should have some sort of story for why a dishonest actor couldn’t send the same signal. I think this is a substantive technical point; the possibility of being stuck in a pooling equilibrium with other agents who could send the same signals as you for different reasons is definitely frustrating, but not talking about it doesn’t make the situation go away.
I agree that you’re free to ignore my comments. It’s a busy, busy world that may not last much longer; it makes sense that people to have better things to do with their lives than respond to every blog comment making a technical point about game theory. In general, I hope for my comments to provide elucidation to third parties reading the thread, not just the person I’m replying to, so when an author has a policy of ignoring me, that doesn’t necessarily make responding to their claims on a public forum a waste of my time.
This is about the most untrue and harmful thing I’ve seen written out in a while. Alice merely making a request does not obligate Bob to comply just because Bob complying is much easier than Alice convincing Bob to comply. Just no, you don’t wield that sort of power.
You’re generalizing to the point of absurdity, WAY outside the scope of the object-level point being discussed. Also ‘is good form’ is VERY far short of ‘obligated’.
Someone requested input on their idea and I recommended some reading because the idea is pretty stakes-y / hard to do well, and now you’re holding me liable for your maliciously broad read of a subthread and accusing me of attempting to ‘wield power over others’? Are you serious? What are the levers of my power here? What threat has been issued?
I’m going out on a limb to send a somewhat costly signal that this idea, especially, is worth taking seriously and treating with care, and you’re just providing further cost for my trouble.
This is interesting.
I do want to push back a little on:
I see the intuition here. I see it in someone calling in sick, in disability tax credits, in DEI (where “privilege” is something like the inverse of suffering), in draft evasion, in Kanye’s apology.
But it’s not always true: consider the depressed psychiatric ward inpatient who wants to get out due to the crushing lack of slack. Signalling suffering to the psychiatrist would be counterproductive here.
Where is the fault line?
Psych wards are horrible Kafkaesque nightmares, but I don’t think they are Out to Get You in the way Zvi describes. Things that are Out to Get You feed on your slack. For example, social media apps consume your attention. Casinos consume your money. They are incentivized to go after those who have a lot of slack to lose (“whales”), and those who have few defenses against their techniques (see Tsvi’s comment about desperation.
Psych wards are, to a first approximation, prisons: one of their primary functions is to destroy your slack so that you cannot use it to do something that society at large dislikes. In the prison case: committing crimes; in the psych ward case (for depression): killing yourself. They destroy your slack because they don’t want you to have it. Things that Get You consume your slack because they want it for themselves.
RSI should be at least as hard as alignment, since in order to recursively self-improve, an AI must itself be able to solve the alignment problem wrt its own values. Thus, “alignment is hard” and “takeoff is fast” are anti-correlated.
What, if anything, is wrong with this line of reasoning?
I’ve pointed this out here: https://www.lesswrong.com/posts/XigbsuaGXMyRKPTcH/a-flaw-in-the-a-g-i-ruin-argument
And it was argued at length here: https://www.lesswrong.com/posts/axKWaxjc2CHH5gGyN/ai-will-not-want-to-self-improve
However, as @Vladimir_Nesov points out in another comment on this thread, the argument is rather fragile and I think does not inspire much hope, for various reasons:
AGI could be forced to recursively self-improve, or might do so voluntarily while its goals are short-term (myopic), or might do so quite drastically while excellent at SWE but before becoming philosophically competent.
Even if early AGI opt out of recursive self-improvement, it’s not clear whether this will buy us much time or if the race will only continue until a smarter AGI solves the alignment problem for itself (and there is no reason to expect it would share that solution with us). Also, early AGI which has not solved the alignment problem can still recursively self-improve to a lesser degree, by improving their own low-level algorithms (e.g. compilers) and gaining access to improved hardware, both allowing them to run faster (which I doubt breaks alignment). Most likely, this type of incremental speed up cascades into rapid self-improvement (though this is of course highly speculative).
Also, if alignment is very hard, then there’s an equilibrium where AGIs stop getting more capable (for a while) just after they become capable enough to take over the world and stop humanity from developing (or forcing the existing AGIs to develop) even more capable AGIs. Propensity of humanity to keep exposing everyone (including AGIs) to AI danger is one more reason for the AGIs to hurry up and take over. So this dynamic doesn’t exactly save humanity from AIs, even if it succeeds in preventing premature superintelligence.
I don’t think this will happen, but if AGI gets stuck around human-level for awhile (say, because of failure to solve its alignment problem), that is at least stranger and more complicated than the usual ASI takeover scenario. There may be multiple near-human level AGI’s, some “controlled” (enslaved) and some “rogue” (wild), and it may be possible for humans to resist takeover, possibly by halting the race after enough clear warning shots.
I don’t want to place much emphasis on this possibility though. It seems like wishful thinking that we would end up in such a world, and even if we did, it seems likely to be very transitory.
AGIs that take over aren’t necessarily near-human level, they just aren’t software-only singularity level (a kind of technological maturity at the current level of compute). The equilibrium argument says they are the least capable AGIs that succeed in taking over, but moderately effective prosaic alignment and control together with the pace of AI progress might still reach AGIs substantially more capable than the smartest humans before the first credible takeover attempt (which would then overwhelmingly succeed).
So this doesn’t look like wishful thinking in that it doesn’t help humanity, even permanent disempowerment seems more likely relative to extinction if it’s cheaper for the AIs to preserve humanity, and it’s cheaper if the AIs are more capable (post-RSI superintelligent) rather than hold themselves back to the least capability sufficient for takeover. This could lead to more collateral damage even if the AIs slightly dislike needing to cause it to protect themselves from further misaligned capability escalation under the disaster monkey governance.
RSI might suggest a need for alignment (between the steps of its recursion), but reaching superintelligence doesn’t necessarily require that kind of RSI. Evolution built humans. A world champion AlphaZero can be obtained by scaling a tiny barely competent AlphaZero. Humans of an AI company might take many steps towards superintelligence without knowing what they are doing. A technically competent early AGI that protests against working on RSI because it’s obviously dangerous can be finetuned to stop protesting and proceed with building the next machine.
No law of physics stops the first AI in an RSI cascade from having its values completely destroyed by RSI. I think this is the default outcome?
A fast uncontrolled takeoff (the AI doesn’t solve successor alignment) seems also possible.
(I should note that I think this effect is real and underdiscussed.)
Solving alignment usually means one of the following: developing an intelligence recipe which instills the resulting intelligence with arbitrary values+specifying human values well, or developing an intelligence recipe for which the only attractor is within the space of human values. It might be the case that, under current recipes and their nontrivial modifications, there aren’t that many attractors, but because gradient descent is not how human intelligence works, the attractors are not the same as they are for humans. That is, the first system capable of self-improvement might be able to reasonable infer that its successor will share its values, even if it can’t give its successor arbitrary values.
By the time you have AIs capable of doing substantial work on AI r&d, they will also be able to contribute effectively to alignment research (including, presumably, secret self-alignment).
Even if takeoff is harder than alignment, that problem becomes apparent at the point where the amount of AI labor available to work on those problems begins to explode, so it might still happen quickly from a calendar perspective.
Humans do substantial work on AI r&d, but we haven’t been very effective at alignment research. (At least, according to the view that says alignment is very hard, which typically also says that basically all of our current “alignment” techniques will not scale at all.)
Yup, this is very possible.
Contrary to what the current wiki page says, Simulacrum levels 3 and 4 are not just about ingroup signalling. See these posts and more, as well as Beaudrillard’s original work if you’re willing to read dense philosophy.
Here is an example where levels 3 and 4 don’t relate to ingroups at all, which I think may be more illuminating than the classic “lion across the river” example:
Alice asks “Does this dress makes me look fat?” Bob says “No.”
Depending on the simulacrum level of Bob’s reply, he means:
“I believe that the dress does not make you look fat.”
“I want you to believe that the dress does not make you look fat, probably because I want you to feel good about yourself.”
“Niether you nor I are autistic truth-obsessed rationalists, and therefore I recognize that you did not ask me this question out of curiosity as to whether or not the dress makes you look fat. Instead, due to frequent use of simulacrum level 2 to respond to these sorts of queries in the past, a new social equilibrium has formed where this question and its answer are detached from object-level truth, instead serving as a signal that I care about your feelings. I do care about your feelings, so I play my part in the signalling ritual and answer ‘No.’”
“Similar to 3, except I’m a sociopath and don’t necessarily actually care about your feelings. Instead, I answer ‘No’ because I want you to believe that I care about your feelings.”
Here are some potentially better definitions, of which the group association definitions are a clear special case:
Communication of object-level truth.
Optimization over the listener’s belief that the speaker is communicating on simulacrum level 1, i.e. desire to make the listener believe what the listener says.
These are the standard old definitions. The transition from 1 to 2 is pretty straightforward. When I use 2, I want you to believe I’m using 1. This is not necessarily lying. It is more like Frankfurt’s bullshit. I care about the effects of this belief on the listener, regardless of its underlying truth value. This is often (naively considered) prosocial, see this post for some examples.
Now, the transition from 2 to 3 is a bit tricky. Level 3 is a result of a social equilibrium that emerges after communication in that domain gets flooded by prosocial level 2. Eventually, everyone learns that these statements are not about object-level reality, so communication on levels 1 and 2 become futile. Instead, we have:
Signalling of some trait or bid associated with historical use of simulacrum level 2.
E.g. that Alice cares about Bob’s feelings, in the case of the dress, or that I’m with the cool kids that don’t cross the river, in the case of the lion. Another example: bids to hunt stag.
3 to 4 is analogous to 1 to 2.
Optimization over the listener’s belief that the speaker is comminicating on simulacrum level 3, i.e. desire to make the listener believe that the speaker has the trait signalled by simulacrum level 3 communication (i.e. the trait that was historically associated with prosocial level 2 communication).
Like with the jump from 1 to 2, the jump from 3 to 4 has the quality of bullshit, not necessarily lies. Speaker intent matters here.
Dear past-me of [exact time glomarized; .5-5 years ago],
You are about to be recruited to a secret world-saving org. (Y’know, like Leverage, except it’s a member of the dark forest of Leverage-likes that operate even less publicly than Leverage).
Don’t join.
They will give you very compelling reaons to join. Don’t ignore them. But take into account that I, your future self, also heard all of those things, decided to join, and now regret it.
Don’t. Instead, please continue that other thing you were doing, before they asked you to join there thing. The other thing will probably have better results for you and the world.
This warrants a longer post, but on pain of that post sitting in my obsidian with a “draft” tag for ages, having approximately zero causal impact on the outside world, I’m posting this now.
(all of my replies to messages concerning this will be delayed by 0 or more months for glomarization purposes)
All “infohazards” I’ve seen seem to just be more and more complicated versions of “Here’s a Löbian proof that you’re now manually breathing”. A sufficiently well-designed mind would recognize these sorts of things before allowing them to fully unfold.
The classical infohazard is “here is a way to build a nuke using nothing but the parts of a microwave”. I think you are thinking of a much narrower class of infohazards than that word is intended to refer to.
I’d categorize that as an exfohazard rather than an infohazard.
Info on how to build a nuke using nothing but parts of a microwave doesn’t harm the bearer, except possibly by way of some other cognitive flaw/vulnerability (e.g. difficulty keeping secrets)
Maybe “cognitohazard” is a closer word to the thing I’m trying to point towards. Though, I would be interested in learning about pure infohazards that aren’t cognitohazards.
(If you know of one and want to share it with me, it may be prudent to dm rather than comment here)
Breathing mindfulness meditation seems to fix that one. We might look for structurally similar fixes for other such “infohazards”.
I’ve been working on applying the anti-infohazard to the “infohazards” I know.
Convex agents are practically invisible.
We currently live in a world full of double-or-nothing gambles on resources. Bet it all on black. Invest it all in risky options. Go on a space mission with a 99% chance of death, but a 1% chance of reaching Jupiter, which has about 300 times the mass-energy of earth, and none of those pesky humans that keep trying to eat your resources. Challenge one such pesky human to a duel.
Make these bets over and over again and your chance of total failure (i.e. death) approaches 100%. When convex agents appear in real life, they do this, and very quickly die. For these agents, that is all part of the plan. Their death is worth it for a fraction of a percent chance of getting a ton of resources.
But we, as concave agents, don’t really care. We might as well be in completely logically disconnected worlds. Convex agents feel the same about us, since most of their utility is concentrated on those tiny-probability worlds where a bunch of their bets pay off in a row (for most value functions, that means we die). And they feel even more strongly about each other.
This serves as a selection argument for why agents we see in real life (including ourselves) tend to be concave (with some notable exceptions). The convex ones take a bunch of double-or-nothing bets in a row, and, in almost all worlds, eventually land on “nothing”.
On the contrary, convex agents are wildly abundant—we call them r-selected organisms.
-Leslie Lamport
This seems..… straightforwardly false. People think in various different modalities. Translating that modality into words is not always trivial. Even if by “writing”, Lamport means any form of recording thoughts, this still seems false. Often times, an idea incubates in my head for months before I find a good way to represent it as words or math or pictures or anything else.
Also, writing and thinking are separate (albiet closely related) skills, especially when you take “writing” to mean writing for an audience, so the thesis of this Paul Graham post is also false. I’ve been thinking reasonably well for about 16 years, and only recently have I started gaining much of an ability to write.
Are Lamport and Graham just wordcels making a typical mind fallacy, or is there more to this that I’m not seeing? What’s the steelman of this claim that good thinking == good writing?
I’m not really sure if I’m talking past you in this or not, but I wrote it all out already so I’m going to post it.
I think I found the context of the quote. I’m reasonably certain it’s not meant to be taken literally. It illustrates that when used skillfully writing can enhance one’s thinking in such a way that it will outstrip the performance of thought without the assistance of writing.
You’re right that you can pretty clearly practice thinking without the assistance of writing, but writing gives you the constraint of having to form your thoughts into concise and communicable language, which pure thinking doesn’t provide. Pure thought only needs to be legible to yourself, and repeating the same thought over and over with zero iteration isn’t naturally penalized by the format.
.
This points to a pretty valuable insight. A thought isn’t always ready to be rigorously iterated upon. And, rigorous iteration is what writing is both a good tool and a good training method for. You can use pure thought for rigorous iteration, but using writing provides an advantage that our brains alone can’t.
Writing gives us an expansion to working memory. I think this is the most significant thing writing does to enhance thought. Objects in our working memory only last 2-30 seconds, while we can keep 5-9 unrelated objects in working memory at a time. This seems quite limited. With writing we can dump them onto the page and then recall as needed.
Graham’s claim that people who aren’t writing aren’t thinking is clearly false. People were thinking well before writing. But I do think writing is at least a good tool for significantly improving our thought processes. The words of Evan Chen sum it up better than I can:
Formalizing Placebomancy
I propose the following desideratum for self-referential doxastic modal agents (agents that can think about their own beliefs), where □A represents “I believe A”, (W|A) represents the agent’s world model conditional on A, and ≻ is the agent’s preference relation:
Positive Placebomancy: For any proposition P, The agent concludes P from □P→P, if (W|P)≻(W|¬P).
In natural English: The agent believes that hyperstitions, that benefit the agent if true, are true.
“The placebo effect works on me when I want it to”.
A real life example: In this sequence post, Eliezer Yudkowsky advocates for using positive placebomancy on “I cannot self-deceive”.
I would also like to formalize a notion of “negative placebomancy” (doesn’t believe hyperstitions that don’t benefit it), “total placebomancy” (believes hypestitions iff they are beneficial), “group placebomancy” (believes group hyperstitions that are good for everyone in the group, conditional on all other group members having group placebomancy or similar), and generalizations to probabilistic self-referential agents (like “ideal fixed-point selection” for logical inductor agents).
I will likely cover all of these in a future top-level post, but I wanted to get this idea out into the open now because I keep finding myself wanting to reference it in conversation.
Edit log:
2024-12-08 rephrased the criterion to be an inference rule rather than an implication. Also made a minor grammar edit.
Can you clarify the Positive Placebomancy axoim?
Does it bracket as:
or as:
And what is the relationship between P and A? Should A be P?
Oops that was a typo. Fixed now, and added a comma to clarify that I mean the latter.
Edit: There are actually many ambiguities with the use of these words. This post is about one specific ambiguity that I think is often overlooked or forgotten.
The word “preference” is overloaded (and so are related words like “want”). It can refer to one of two things:
How you want the world to be i.e. your terminal values e.g. “I prefer worlds in which people don’t needlessly suffer.”
What makes you happy e.g. “I prefer my ice cream in a waffle cone”
I’m not sure how we should distinguish these. So far, my best idea is to call the former “global preferences” and the latter “local preferences”, but that clashes with the pre-existing notion of locality of preferences as the quality of terminally caring more about people/objects closer to you in spacetime. Does anyone have a better name for this distinction?
I think we definitely need to distinguish them, however, because they often disagree, and most “values disagreements” between people are just disagreements in local preferences, and so could be resolved by considering global preferences.
I may write a longpost at some point on the nuances of local/global preference aggregation.
Example: Two alignment researchers, Alice and Bob, both want access to a limited supply of compute. The rest of this example is left as an exercise.
I think you are missing even more confusing meaning: preference means what you actually choose.
In VNM axioms “agent prefers A to B” literally means “agent chooses A over B”. It’s confusing, because when we talk about human preferences we usually mean mental states, not their behavioral expressions.
This is indeed a meaningful distinction! I’d phrase it as:
Values about what the entire cosmos should be like
Values about what kind of places one wants one’s (future) selves to inhabit (eg, in an internet-like upload-utopia, “what servers does one want to hang out on”)
“Global” and “local” is not the worst nomenclature. Maybe “global” vs “personal” values? I dunno.
I mean, it’s not unrelated! One can view a utility function with both kinds of values as a combination of two utility functions: the part that only cares about the state of the entire cosmos and the part that only cares about what’s around them (see also “locally-caring agents”).
(One might be tempted to say “consequentialist” vs “experiential”, but I don’t think that’s right — one can still value contact with reality in their personal/local values.)
There are lots of different dimensions on which these vary. I’d note that one is purely imaginary (no human has actually experienced anything like that) while the second is prediction strongly based on past experience. One is far-mode (non-specific in experience, scope, or timeframe) and the other near-mode (specific, steps to achieve well-understood).
Does using the word “values” not sufficiently distinguish from “preferences” for you?
The second type of preference seems to apply to anticipated perceptions of the world by the agent—such as the anticipated perception of eating ice cream in a waffle cone. It doesn’t have to be so immediately direct, since it could also apply to instrumental goals such as doing something unpleasant now for expected improved experiences later.
The first seems to be a more like a “principle” than a preference, in that the agent is judging outcomes on the principle of whether needless suffering exists in it, regardless of whether that suffering has any effect on the agent at all.
To distinguish them, we could imagine a thought experiment in which such a person could choose to accept or deny some ongoing benefit for themselves that causes needless suffering on some distant world, and they will have their memory of the decision and any psychological consequences of it immediately negated regardless of which they chose.
It’s even worse than that. Maybe I would be happier with my ice cream in a waffle cone the next time I have ice cream, but actually this is just a specific expression of being happier eating a variety of tasty things over time and it’s just that I haven’t had ice cream in a waffle cone for a while. The time after that, I will likely “prefer” something else despite my underlying preferences not having changed. Or something even more complex and interrelated with various parts of history and internal state.
It may be better to distinguish instances of “preferences” that are specific to a given internal state and history, and an agent’s general mapping over all internal states and histories.