I like how you think, but I don’t think it’s entirely driven by survivorship bias—experimental evidence shows that people are more motivated to access information when it’s suppressed than when it’s accessible (a phenomenon called psychological reactance).
Annabelle
Thank you for your thoughtful response.
It makes the same kind of sense as still planning for a business-as-usual 10-20 year future.
Agreed. I don’t know whether that approach to planning makes sense either, though. Given a high (say 90%)[1] p(doom) in the short term, would a rational actor change how they life their life? I’d think yes, in some easier ways to accept (assigning a higher priority to short-term pleasure, maybe rethinking effortful long-term projects that involve significant net suffering in the short term) as well as some less savoury ways that would probably be irresponsible to post online but would be consistent with a hedonistic utilitarian approach (i.e., prioritizing minimizing suffering).
- ^
Choosing 90% because that’s what I would confidently bet on—I recognize many people in the community would assign a lower probability to existential catastrophe at this time.
- ^
I understand that. I’m not sure I understand your point here, though—wouldn’t it still be an arguably poor use of effort to sign up for cryonics if likely outcomes ranged from (1) an increasingly unlikely chance of people being revived, at best, to (2) being revived by a superintelligence with goals hostile to those of humanity, at worst?
How does cryonics make sense in the age of high x-risk? As p(doom) increases, cryonics seems like a worse bet because (1) most x-risk scenarios would result in frozen people/infrastructure/the world being destroyed and (2) revival/uploading would be increasingly likely to be performed by a misaligned ASI hoping to use humans for its own purposes (such as trade). Can someone help me understand what I’m missing here or clarify how cryonics advocates think about this?
This is my first quick take—feedback welcome!
Ignoring the serious ethical issues inherent in manipulating people’s emotions for instrumental gain, this strategy seems highly (I’d say 95%+) likely to backfire. Intergroup relations research shows that strong us-vs-them dynamics leads to radicalization and loss of control of social movements, and motivated reasoning literature demonstrates that identity-defining beliefs inhibit evidence-based reasoning. Moreover, even if this somehow worked, cultivating hatred of Silicon Valley and Big Tech would likely lead to the persecution of EY-types and other AI safety researchers with the most valuable insights on the matter.
Would someone who legitimately, deeply believes lack of diversity of perspective would be catastrophic, and who values avoiding that catastrophe and thus will in fact take rapid, highest-priority action to get as close as possible to democratically constructed values and collectively rational insight, be able to avoid this problem?
No, I don’t think anyone could, barring the highly unlikely case of a superintelligence perfectly aligned with human values (and even still, maybe not—human values are inconsistent and contradictory). Also, I think a system of democratically constructed values would probably be at odds with rational insight, unfortunately.
Regarding the rest, agreed. Heading into verbotten-ish political territory here, but see also Jenny Holzer and Chomsky on this.
Maybe the result of one person’s clones forming a very capable Em Collective would still be suboptimal and undemocratic from the perspective of the rest of humanity, but it wouldn’t kill everyone, and I think wouldn’t lead to especially bad outcomes if you start from the right person.
I think the risk of a homogeneous collective of many instances of a single person’s consciousness is more serious than “suboptimal and undemocratic” suggests. Even assuming you could find a perfectly well-intentioned person to clone, identical minds share the same blindspots and biases. Without diversity of perspective, even earnestly benevolent ideas could—and I imagine would—lead to unintented catastrophe.
I also wonder how you would identify the right person, as I can’t think of anyone I would trust with that degree of power.
I agree that “psychosis” is probably not a great term for this. “Mania” feels closer to what the typical case is like. It would be nice to have an actual psychiatrist weigh in.
I’m a clinical psychology PhD student. Please take the following as psychoeducation, and with the strong caveats that (1) I cannot diagnose without supervision as I am not an independently licensed practitioner and (2) I would not have enough information to diagnose here even if I were.
Mania and psychosis are not mutually exclusive. Under the DSM-5-TR, Bipolar I has two relevant specifiers: (a) with mood-congruent psychotic features, and (b) with mood-incongruent psychotic features. The symptoms described here would be characterized as mood-congruent (though this does not imply the individual would meet criteria for Bipolar I or any other mental disorder).
Of course, differential diagnosis for any sort of “AI psychosis” case would involve considering multiple schizophrenia spectrum and other psychotic disorders as well, alongside a thorough consideration of substance use, medication history, physical health, life circumstances, etc. to determine which diagnosis—if any—provides the most parsimonious explanation for the described symptoms. Like most classification schemes, diagnostic categories are imperfect and useful to the extent that they serve their function in a given context.
Hi all! My name is Annabelle. I’m a Clinical Psychology PhD student primarily studying psychosocial correlates of substance use using intensive longitudinal designs. Other research areas include Borderline Personality Disorder, stigma and prejudice, and Self-Determination Theory.
I happened upon LessWrong while looking into AI alignment research and am impressed with the quality of discussion here. While I lack some relevant domain knowledge, I am eagerly working through the Sequences and have downloaded/accessed some computer science and machine learning textbooks to get started.
I have questions for LW veterans and would appreciate guidance on where best to ask them. Here are two:
(1) Has anyone documented how to successfully reduce—and maintain reduced—sycophancy over multi-turn conversations with LLMs? I learn through Socratic questioning, but models seem to “interpret” this as seeking validation and become increasingly agreeable in response. When I try to correct for this (using prompts like “think critically and anticipate counterarguments,” “maintain multiple hypotheses throughout and assign probabilities to them,” and “assume I will detect sycophancy”), I’ve found models overcorrect and become excessively critical/contrarian without engaging in improved critical thinking.[1] I understand this is an inherent problem of RLHF.
(2) Have there been any recent discussions about navigating practical life decisions under the assumption of a high p(doom)? I’ve read Eliezer’s death with dignity piece and reviewed the alignment problem mental health resources, but am interested in how others are behaviourally updating on this. It seems bunkers and excessive planning are probably futile and/or a poor use of time and effort, but are people reconsidering demanding careers? To what extent is a delayed gratification approach less compelling nowadays, and how might this impact financial management? Do people have any sort of low-cost, short-term-suffering-mitigation plans in place for the possible window between the emergence of ASI and people dying/getting uploaded/other horrors?
I have many more questions about a broad range of topics—many of which do not concern AI—and would appreciate guidance about norms re: how many questions are appropriate per comment, whether there are better venues for specific sorts of questions, etc.
Thank you, and I look forward to engaging with the community!
- ^
I find Claude’s models are more prone to sycophancy/contrarianism than Perplexity’s Sonar model, but I would prefer to use Claude as it seems superior in other important respects.
- ^
Note that this “childhood self” does not seem to be particularly based on anything endogenous to the user (who has barely provided any details thus far, though it’s possible more details are saved in memory), but is instead mythologized by ChatGPT in a long exercise in creative writing. The user even abdicates his side of the interaction with it to the AI (at the AI’s suggestion).
Indeed, these parasitic LLM instances appear to rely almost exclusively on Barnum statements to “hook” users. Cue second-order AI psychosis from AI-generated horoscopes...
Thank you for conducting this research! This paper led me to develop an interest in AI alignment and discover LessWrong.
In this post and in the paper, you mention you do not address the case of adversarial deception. To my lay understanding, adversarial deception refers to models autonomously feigning alignment during training, whereas the covert actions investigated in this paper reflect models feigning alignment during evaluation following prior training to do so. This leads me to wonder how distinct the mechanisms models use to scheme during testing are from those required for scheming during training.
Without any domain knowledge, I would assume situational awareness and consequent behaviour modification would likewise be required for adversarial deception. If this is true, could one make the argument that your findings reflect early evidence of key capabilities required for adversarial deception? More abstractly, to what extent could covert actions—if they emerged organically rather than being trained in as in your study—be understood as adversarial deception lite?
I think you’re headed in the right direction, yes: people can only experience psychological reactance when they are aware that information is being suppressed, and most information suppression is successful (in that the information is suppressed and the suppression attempt is covert). In the instances where the suppression attempt is overt, a number of factors determine whether the “Streisand Effect” occurs (the novelty/importance of the information, the number of people who notice the suppression attempt, the traits/values/interests/influence of the people that notice the suppression attempt, whether it’s a slow news day, etc.). I think survivorship bias is relevant to the extent that it leads people to overestimate how often the Streisand Effect occurs in response to attempts to suppress information. Does that sound about right to you?