Hi all! My name is Annabelle. I’m a Clinical Psychology PhD student primarily studying psychosocial correlates of substance use using intensive longitudinal designs. Other research areas include Borderline Personality Disorder, stigma and prejudice, and Self-Determination Theory.
I happened upon LessWrong while looking into AI alignment research and am impressed with the quality of discussion here. While I lack some relevant domain knowledge, I am eagerly working through the Sequences and have downloaded/accessed some computer science and machine learning textbooks to get started.
I have questions for LW veterans and would appreciate guidance on where best to ask them. Here are two:
(1) Has anyone documented how to successfully reduce—and maintain reduced—sycophancy over multi-turn conversations with LLMs? I learn through Socratic questioning, but models seem to “interpret” this as seeking validation and become increasingly agreeable in response. When I try to correct for this (using prompts like “think critically and anticipate counterarguments,” “maintain multiple hypotheses throughout and assign probabilities to them,” and “assume I will detect sycophancy”), I’ve found models overcorrect and become excessively critical/contrarian without engaging in improved critical thinking.[1] I understand this is an inherent problem of RLHF.
(2) Have there been any recent discussions about navigating practical life decisions under the assumption of a high p(doom)? I’ve read Eliezer’s death with dignity piece and reviewed the alignment problem mental health resources, but am interested in how others are behaviourally updating on this. It seems bunkers and excessive planning are probably futile and/or a poor use of time and effort, but are people reconsidering demanding careers? To what extent is a delayed gratification approach less compelling nowadays, and how might this impact financial management? Do people have any sort of low-cost, short-term-suffering-mitigation plans in place for the possible window between the emergence of ASI and people dying/getting uploaded/other horrors?
I have many more questions about a broad range of topics—many of which do not concern AI—and would appreciate guidance about norms re: how many questions are appropriate per comment, whether there are better venues for specific sorts of questions, etc.
Thank you, and I look forward to engaging with the community!
I find Claude’s models are more prone to sycophancy/contrarianism than Perplexity’s Sonar model, but I would prefer to use Claude as it seems superior in other important respects.
Hi all! My name is Annabelle. I’m a Clinical Psychology PhD student primarily studying psychosocial correlates of substance use using intensive longitudinal designs. Other research areas include Borderline Personality Disorder, stigma and prejudice, and Self-Determination Theory.
I happened upon LessWrong while looking into AI alignment research and am impressed with the quality of discussion here. While I lack some relevant domain knowledge, I am eagerly working through the Sequences and have downloaded/accessed some computer science and machine learning textbooks to get started.
I have questions for LW veterans and would appreciate guidance on where best to ask them. Here are two:
(1) Has anyone documented how to successfully reduce—and maintain reduced—sycophancy over multi-turn conversations with LLMs? I learn through Socratic questioning, but models seem to “interpret” this as seeking validation and become increasingly agreeable in response. When I try to correct for this (using prompts like “think critically and anticipate counterarguments,” “maintain multiple hypotheses throughout and assign probabilities to them,” and “assume I will detect sycophancy”), I’ve found models overcorrect and become excessively critical/contrarian without engaging in improved critical thinking.[1] I understand this is an inherent problem of RLHF.
(2) Have there been any recent discussions about navigating practical life decisions under the assumption of a high p(doom)? I’ve read Eliezer’s death with dignity piece and reviewed the alignment problem mental health resources, but am interested in how others are behaviourally updating on this. It seems bunkers and excessive planning are probably futile and/or a poor use of time and effort, but are people reconsidering demanding careers? To what extent is a delayed gratification approach less compelling nowadays, and how might this impact financial management? Do people have any sort of low-cost, short-term-suffering-mitigation plans in place for the possible window between the emergence of ASI and people dying/getting uploaded/other horrors?
I have many more questions about a broad range of topics—many of which do not concern AI—and would appreciate guidance about norms re: how many questions are appropriate per comment, whether there are better venues for specific sorts of questions, etc.
Thank you, and I look forward to engaging with the community!
I find Claude’s models are more prone to sycophancy/contrarianism than Perplexity’s Sonar model, but I would prefer to use Claude as it seems superior in other important respects.