LessWrong team member / moderator. I’ve been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I’ve been interested in improving my own epistemic standards and helping others to do so as well.
Raemon
I’m not getting your 35% to 5% reference? I just have no hope of getting as low as 5%, but a lot of hope for improving on just letting the labs take a swing.
i.e, if basically anything other than a long pause will be insufficient to actually work, you might as well swing for the pause.
I’m thinking of a slightly different plan than “increase the rate of people being able to think seriously about the problem” I’d like to convince people who already understand the problem to accept that pause is unlikely and alignment is not known to be impossibly hard even on short timelines. …
...Getting entirely new people to understand the hard parts of the problem and then understand all of the technical skills or theoretical subtleties is another route. I haven’t thought as much about that one because I don’t have a public platform,
I think it’s useful to think of “rate of competent people think seriously about the right problems” is, like, the “units” of success for various flavors of plans here. There are different bottlenecks.
I currently think the rate-limiting reagent is “people who understand the problem”. And I think that’s in turn rate-limited on:
“the problem is sort of wonky and hard with bad feedbackloops and there’s a cluster of attitudes and skills you need to have any traction sitting and grokking the problem.”
“we don’t have much ability to evaluate progress on the problem, which in turn means it’s harder to provide a good funding/management infrastructure for it.”
I think “moving an overton window” is a sort of different operation than what Bengio/Hinton/Dario are doing. (Or, like, yes, they are expanding an overton window, but, their entire strategy for doing so seems predicated on a certain kind of caution/incrementalness)
I think there are two pretty different workable strategies:
say things somewhat outside the window, picking your battles
make bold claims, while believing in your convictions with enough strength and without looking “attackable for mispeaking”.
Going halfway from one to the other doesn’t actually work, and the second one doesn’t really work unless you actually do have those convictions. There are a few people trying to do the latter, but, most of them just don’t actually have the reputation that’d make anyone care (and also there’s a lot of skill to doing it right). I think if at least one of Yoshua/Geoffrey/Dario/Demis switched strategies it’d make a big difference.
Do you mean like a short pithy name for the fallacy/failure-mode?
That’s why I want to convince more people that actually understand the problem to identify and work like mad on the hard parts like the world is on fire, instead of hoping it somehow isn’t or can be put out.
FYI something similar to this was basically my “last year’s plan”, and it’s on hold because I think it is plausible right now to meaningfully move the overton window around pauses or at least dramatic slowdowns. (This is based on seeing the amount of traffic AI 2027 got, and the number of NatSec endorsements that If Anyone Builds It Got, and having recently gotten to read it and thinking it is pretty good)
I think if Yoshua Bengio, Geoffrey Hinton, or Dario actually really tried to move overton windows instead of sort of trying to manuever within the current one, it’d make a huge difference. (I don’t think this means it’s necessarily tractable for most people to help. It’s a high-skill operation)
(Another reason for me putting “increase the rate of people able to think seriously about the problem” on hold is that my plans there weren’t getting that much traction. I have some models of what I’d try next when/if I return to it but it wasn’t a slam dunk to keep going)
I assume a lot of us aren’t very engaged with pause efforts or hopes because it seems more productive and realistic to work on reducing ~70% toward ~35% misalignment risks.
Nod. I do just, like, don’t think that’s actually that great a strategy – it presupposes it is actually easier to get from 70% to 35% than from 35% to 5%. I don’t see Anthropic-et-al actually really getting ready to ask the sort of questions that would IMO be necessary to actually do-the-reducing.
I do realize point 2 is not the way LW is intended to operate
Well I would say the whole reason LW mods are banning Said is that we do, in fact, want LW to operate this way. (Or, directionally similarly to this). I do also want wrong ideas to get noticed and discarded, and I do want “good taste in generating ideas” (there are people who aren’t skilled enough at casual idea generation for me to feel excited about them generating such conversation on LW). But I think it’s an essential part of any real generative intellectual tradition.
you could simply ignore or stop replying to him if you thought his style of conversation was too extreme for your tastes, instead of feeling like his “entrance to my comment threads was a minor emergency”.
I wanna flag, your use of the word “simply” here is… like, idk, false.
I do think it’s good for people to learn the skill of not caring what other people think and being able to think out loud even when someone is being annoying. But, this is a pretty difficult skill for lots of people. I think it’s pretty common for people who are attempting to learn it to instead end up contorting their original thought process around what the anticipated social punishment.
I think it’s a coherent position to want LessWrong “price of entry” to be gaining that skill. I don’t think it’s a reasonable position to call it “simply...”. It’s asking for like 10-200 hours of pretty scary, painful work.
Simulating the *rest* of the political disagreement
This seems like it’s engaging with the question of “what do critics think?” in a sort of model-free, uninformed, “who to defer to” sort of way.
For awhile, I didn’t fully update on arguments for AI Risk being a Big Deal because the arguments were kinda complex and I could imagine clever arguers convincing me of it without it being true. One of the things that updated me over the course of 4 years was actually reading the replies (including by people like Hanson) and thinking “man, they didn’t seem to even understand or address the main points.”
i.e. it’s not that they didn’t engage with the arguments, it’s that they engaged with the arguments badly which lowered my credence on taking their opinion seriously.
(I think nowadays I have seen some critics who do seem to me to have engaged with most of the real points. None of their counterarguments seem like they’ve added up to “AI is not a huge fucking deal that is extremely risky” in a way that makes any sense to me, but, some of them add up to alternate frames of looking at the problem that might shift what is the best thing(s) to do about it)
Periodically I’ve considered writing a post similar to this. A piece that I think this doesn’t fully dive into is “did Anthropic have a commitment not to push the capability frontier?”.
I had once written a doc aimed at Anthropic employees, during SB 1047 Era, when I had been felt like Anthropic was advocating for changes to the law that were hard to interpret un-cynically.[1] I’ve had a vague intention to rewrite this into a more public facing thing, but, for now I’m just going to lift out the section talking about the “pushing the capability frontier” thing.
When I chatted with several anthropic employees at the happy hour a
couple months~year ago, at some point I brought up the “Dustin Moskowitz’s earnest belief was that Anthropic had an explicit policy of not advancing the AI frontier” thing. Some employees have said something like “that was never an explicit commitment. It might have been a thing we were generally trying to do a couple years ago, but that was more like “our de facto strategic priorities at the time”, not “an explicit policy or commitment.”When I brought it up, the vibe in the discussion-circle was “yeah, that is kinda weird, I don’t know what happened there”, and then the conversation moved on.
I regret that. This is an extremely big deal. I’m disappointed in the other Anthropic folk for shrugging and moving on, and disappointed in myself for letting it happen.
First, recapping the Dustin Moskowitz quote (which FYI I saw personally before it was taken down)
First, gwern also claims he talked to Dario and came away with this impression:
> Well, if Dustin sees no problem in talking about it, and it’s become a major policy concern, then I guess I should disclose that I spent a while talking with Dario back in late October 2022 (ie. pre-RSP in Sept 2023), and we discussed Anthropic’s scaling policy at some length, and I too came away with the same impression everyone else seems to have: that Anthropic’s AI-arms-race policy was to invest heavily in scaling, creating models at or pushing the frontier to do safety research on, but that they would only release access to second-best models & would not ratchet capabilities up, and it would wait for someone else to do so before catching up. So it would not contribute to races but not fall behind and become irrelevant/noncompetitive.
> And Anthropic’s release of Claude-1 and Claude-2 always seemed to match that policy—even if Claude-2 had a larger context window for a long time than any other decent available model, Claude-2 was still substantially weaker than ChatGPT-4. (Recall that the causus belli for Sam Altman trying to fire Helen Toner from the OA board was a passing reference in a co-authored paper to Anthropic not pushing the frontier like OA did.)
I get that y’all have more bits of information than me about what Dario is like. But, some major hypotheses you need to be considering here are a spectrum between:
Dustin Moskowitz and Gwern both interpreted Dario’s claims as more like commitments than Dario meant, and a reasonable bystander would attribute this more to Dustin/Gwern reading too much into it.
Dario communicated poorly, in a way that was maybe understandable, but predictably would leave many people confused.
Dario in fact changed his mind explicitly (making this was more like a broken commitment, and subsequent claims that it was not a broken commitment more like lies)
Dario deliberately phrased things in an openended/confusing way, optimized to be reassuring to a major stakeholder without actually making the commitments that would have backed up that reassurance.
Dario straight up lied to both of them.
Dario is lying to/confusing himself.
This is important because:
a) even option 2 seems pretty bad given the stakes. I might cut many people slack for communicating poorly by accident, but when someone is raising huge amounts of money, building technology that is likely to be very dangerous by default, accidentally misleading a key stakeholder is not something you can just shrug off.
b) if we’re in worlds with options 3, 4 or 5 or 6 (and, really, even option 2), you should be more skeptical of other reassuring things Dario has said. It’s not that important to distinguish between these two because the question isn’t “how good a person is Dario?”, it’s “how should you interpret and trust things Dario says”.
In my last chat with Anthropic employees, people talked about meetings and slack channels where people asked probing, important questions, and Dario didn’t shy away from actually answering them, in a way that felt compelling. But, if Dario is skilled at saying things to smart people with major leverage over him that sound reassuring, but leave them with a false impression, you need to be a lot more skeptical of your-sense-of-having-been-reassured.
- ^
in particular, advocating for removing the whistleblower clause, and simulaneously arguing that “we don’t know how to make a good SSP yet, which is why there shouldn’t yet be regulations about how to do it” while also arguing “companies liability for catastrophic harms should be dependent on how good their SSP was.”
(Another mod leaned in the other direction, and I do think there’s like, this is is pretty factual and timeless, and Dario is more of a public figure than an inside-baseball lesswrong community member, so, seemed okay to err in the other direction. But still flagging it as an edge case for people trying to intuit the rules)
I was unsure about it, the criteria for frontpage are “Timeless” (which I agree this qualifies as) and “not inside baseball-y” (often with vaguely political undertones), which seemed less obvious. My decision at the the time was “strong upvote but personal blog”, but I think it’s not obvious and if another LW mod. I agree it’s a bunch of good information to have in one place.
This seems basically a duplicate of Task (AI Goal)
I mean there is ~no prior art here because humanity just invented LLMs last ~tuesday.
Okay j’/k, there may be some. But, I think you’re imagining “the LLM is judging whether the content as good” as opposed to “the LLM is given formulaic rules to evaluate posts for, and it returns ‘yes/no/maybe’ for each of those evaluations.”
The question here is more “is it possible to construct rules that are useful?”
(in the conversation that generated this idea, one person noted “on my youtube channel, it’d be pretty great if I could just identify any comment that mentions someone’s appearance and have it automoderated as ‘off topic’”. If we were trying this on a LessWrong-like community, the rules I might want to try to implement would probably be subtler and I don’t know if LLMs could actually pull them off).
I’m not sure if you understood this or not, but, I think Alex meant “nutrient” in a metaphorical way rather than a literal dietary way.
The basic framework here seems plausible but it’d be easier to say something useful if you gave more specific worked examples from your life.
So you now need to pay before you post?
and comments are disabled when you’re out of funds? natural consequence but lol.
There’s a few ways you could do it. It occurs me now it could actually be the commenter’s job to pay via microtransactions, and maybe the author can tip back if they like it via a Flattr-ish. This also maybe solves the rate limits.
You could also just set it to “when you run out of money, everyone can commit without restriction.”
You could also have, like, everyone just pays a monthly subscription to participate. I think the above ideas are kinda cute tho.
privacy concerns
I was imagining this for public-ish internet where I’d expect it to be digested for the next round of LLM training anyway.
Curated. I’m not quite sure what to make of this point, but I am surprised I hadn’t heard it suggested before.
I’m… not really sure what I expect but I am curious for someone to go make some stone tools and let us know, uh, was it good for you?
FYI I currently would mainline guess that this is true. Also I don’t get why current evidence says anything about it – current AIs aren’t dangerous, but that doesn’t really say anything about whether an AI that’s capable of speeding up superalignment or pivotal-act-relevant research by even 2x would be dangerous.