Thane Ruthenis comments on Caleb Biddulph’s Shortform

Thane Ruthenis 28 May 2025 7:39 UTC
38 points
15
It’s not as though avoiding saying the word “sycophancy” would make ChatGPT any less sycophantic
… Are we sure about this? LLMs do be weird. Stuff is heavily entangled within them, such that, e. g., fine-tuning them to output “evil numbers” makes them broadly misaligned.
Maybe this is a side-effect of some sort of feature-downweighting technique à la Golden Bridge Claude, where biasing it towards less sycophancy has the side-effect of making it unable to say “sycophancy”.