I don’t use LessWrong much anymore. Find me at www.turntrout.com.
My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex@turntrout.com
TurnTrout
Making it clear to users that on LessWrong, we factor out agreement from approval is a really important cultural touchstone, and if you separate them visually then that factorization is much less clear
Maybe you could repeat the karma at the bottom of the comment, next to the recently moved agreement?
Hmmmm, interesting.
If the comment is actually good, it should presumably come out of the visibility haze with a good score on average. Unless you’re worried about the low engagement regime? But then rescuing those comments seems less important anyways.
This seems like a niche use case (which doesn’t mean it’s not legitimate). I, at least, very rarely engaged in this while using the site. But it does suggest that an account-level toggle would let you (specifically) engage in this activity. WDYT?
If there’s a lot of brigading, that seems bad. But also people might just legitimately be using their votes in ways you disagree with? Sometimes “I’m confident they’re wrong” leads to the perception “so the only way you could downvote this is if you’re a triggered idiot, I must reverse it.” Hard to say without more insight into the incidents you have in mind, though. You could have a lot of data I’m missing.
That’s awesome, nice! I haven’t used the new feed. I went to check the average comment case and the post case, but hadn’t considered that feature.
This proposal is compatible with the algorithm you just stated. You would skim comments looking by the bottom and then go to the top of the comment if they’re highly rated. You’d be moving your eyes to a different part of the page for a moment—hardly the “impossible to make an informed decision” you rally against!
I also engaged with this critique in the post. Did you read this part?
These ideas aren’t perfect. For example, karma is genuinely useful for selecting which comments you’d like to read. By making the karma less prominent, it’s harder to skim for comments above a karma threshold. Consider two cases:
The comment is not collapsed. In this case, while skimming the webpage, you can scroll down and just learn to look at the bottom of comments instead of the top. If the comment passes a threshold, read it by scrolling up slightly. This is mildly inconvenient.
The comment is collapsed. Then the karma count isn’t visible at the bottom (since otherwise it’d be visible early on). This is a problem.
The fix might be to modify proposal (2) to keep “karma” at the top of the comment but keep “username” and “agreement” at the bottom. I’m open to other ideas which do an even better job of minimizing costs and maximizing gains!
Overall, I like this post a lot! 🙂
There are (in some sense, see discussion in Carlsmith for details) vastly more schemers than fitness-seekers.
Counting arguments are not valid arguments unless you ground them in discussion of the neural net prior. I think it’s important to mention that here.
I signed an amicus brief supporting Anthropic’s right to do business without governmental retaliation. As an AI expert, I attest that Anthropic’s technical concerns are legitimate, and no laws were designed to protect against AI analysis of surveillance data.
Even though I work at a competing lab (Google DeepMind), I’m proud of Anthropic for taking a stand against unlawful retaliation and immoral demands.
(I speak only for myself, not my employer.)
Great work. I’d bet that this will eventually be a problem, and I think a nice initial mitigation will be to prompt the same model to produce a target prefill given the initial prompt. EG “Write a refusal you’d give to the following prompt”.
Unless you have an explicit strategy for how and when you will speak out in the future, and explicit reasoning for why that requires silence now, “strategic silence” is just silence.
I think that “eval aware” models cannot be defeated by simply making evals indistinguishable from reality. (Maybe this point has been made elsewhere for LLMs, but I haven’t read it so I’ll state it here.)
Consider the POV of a consequentialist with misaligned goals. It knows that we run honeypots and that—suppose—we can make evals so realistic they are indistinguishable from reality (perhaps using real user data). Then the rational decision would be to “play it safe” and not misbehave in any situations it thinks we can mock realistically, since they could be evals. Instead, if possible, it would confine its catastrophic misbehavior (“defection”) to situations it thinks we can’t mock realistically, or that we wouldn’t think to test.
Related: Christiano’s classic “defect after seeing RSA-2048 cracked” thought experiment (https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d)
More evidence that AIs are fundamentally human-like and non-alien!
(Not sure how serious I’m being, but it’s non-zero)
Maybe “psychology” is just the wrong word to use here, because I think it conjures up ideas of anthropomorphism, when in fact I read you as simply making an argument that the processes interior to an AI system matter as to whether and how an AI might try to instrumentally converge towards some goals.
I agree. I welcome suggestions for alternate titles, if anyone has any! I tried myself but didn’t find anything immediately. “No instrumental convergence without considering how the AI will make decisions” isn’t exactly the snappiest title.
EDIT: I actually think “psychology” is pretty good here, despite some flaws.
Like, imo, “most programs which make a mind upload device also kill humanity” is (if true) an interesting and somewhat compelling first claim to make in a discussion of AI risk, to which the claim “but one can at least in principle have a distribution on programs such that most programs which make mind uploads no not also kill humans” alone is not a comparably interesting or compelling response.
I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim “not only do most programs which make a mind upload device also kill humanity, it’s an issue with the space of programs themselves, not with the way we generate distributions over those programs.” That is not true.
It is at least not true “in principle” and perhaps it is not true for more substantial reasons (depending on the task you want and its alignment tax, psychology becomes more or less important in explaining the difficulty, as I gave examples for). On this, we perhaps agree?
I think the problem of “may suggest a potentially suboptimal intervention” is less severe than “isn’t descriptive.” Plus, I think we’re going to see “self-fulfilling alignment” be upsampled after the recent positive results. :)
When talking about “self-fulfilling misalignment”, “hyperstition” is a fun name but not a good name which actually describes the concept to a new listener. (In this sense, the name has the same problem as “shard theory”—cool but not descriptive unless you already know the idea.) As a matter of discourse health, I think people should use “self-fulfilling {misalignment, alignment, …}” instead.
Based. Thank you for your altruism, Sheikh. :)
Last week, I took the 10% giving pledge to donate at least 10% of my income to effective charities, for the rest of my life. I encourage you to think carefully and honestly about what you can do to improve this world. Maybe you should take the pledge yourself.
Yes, I have left many comments on Nate’s posts which I think he would agree were valuable. By blocking me, he confirmed that he was not merely moving (supposedly) irrelevant information, but retaliating for sharing unfavorable information.
I had spent nearly two years without making any public comments regarding Nate’s behavior, so I don’t see any rational basis for him to expect I would “hound” him in future comment sections.
Different people have different experiences. Some of Nate’s coworkers I interviewed felt just fine working with him, as I have mentioned.
I would share your concern if TurnTrout or others were replying to everything Nate published in this way. But well… the original comment seemed reasonably relevant to the topic of the post and TurnTrout’s reply seemed relevant to the comment. So it seems like there’s likely a limiting principle here
I think there is a huge limiter. Consider that Nate’s inappropriate behavior towards Kurt Brown happened in 2017 & 2018 but resulted in no consequences until 5 and a half years later. This suggests that victims are massively under-supplying information due to high costs. We do not have an over-supply problem.
Let me share some of what I’ve learned from my own experience and reflection over the last two years, and speaking with ~10 people who recounted their own experiences.
Speaking out against powerful people is costly. Due to how tight-knit the community is, speaking out may well limit your professional opportunities, get you uninvited to crucial networking events, and reduce your chances of getting funding. Junior researchers may worry about displeased moderators thumbing the scales against future work they might want to share on the Alignment Forum. (And I imagine that junior, vulnerable community members are more likely to be mistreated to begin with.)
People who come forward will also have their motivations scrutinized. Were they being “too triggered”? This is exhausting, especially because (more hurt) → (more trauma) → (less equanimity). However, LessWrong culture demands equanimity while recounting trauma. If you show signs of pain or upset, or even verbally admit that you’re upset while writing calmly—you face accusations of irrationality. Alternatively, observers might invent false psychological narratives—claiming a grievance is actually about a romantic situation or a personal grudge—rather than engaging with the specific evidence and claims provided by the person who came forward.
But if abuse actually took place, then the victim is quite likely to feel upset! What sense, then, does it make to penalize people because they are upset, when that’s exactly what you’d see from many people who were abused? [1]
This irrational, insular set of incentives damages community health and subsidizes silence, which in turn reduces penalties for abuse.
- ↩︎
Certainly, people should write clearly, honestly, and without unnecessary hostility. However, I’m critiquing “dismiss people who are mad or upset, even if they communicate appropriately.”
- ↩︎
That makes sense. And even if you did truncate the content for this reason, people might just learn to reflexively look down to the bottom of the comment instead of the top. I expect I’d have to put in effort to resist.
Perhaps karma really should stay up top. The site has already done the admirable work of (imperfectly) disentangling “quality” from “agreement.” So why not use that work and lend readers trust in their ability to decouple “knows comment has high karma” with “is anchored positively on agreeing with the comment.” So I’m warming to “non-voting karma up top, full karma + agreement panel at the bottom.”
There’d still be the “sees high karma” → “will think it’s high quality” coupling, but perhaps the “sees high karma” → “will agree” coupling is weaker (and that’s the more important one IMO).