Quinn
Speedrunning 4 mistakes you make when your alignment strategy is based on formal proof
Takeaways from the Intelligence Rising RPG
Announcing the Technical AI Safety Podcast
This comment’s updates for me personally:
The overall “EA is scary / criticizing leaders is scary” meme is very frequently something I roll my eyes at, I find it alien and sometimes laughable when people say they’re worried about being bold and brave cuz all I ever see are people being rewarded for constructive criticism. But man, I feel like if I didn’t know about some of this stuff then I’m missing a huge piece of the puzzle. Unclear yet what I’ll think about, say, the anon meta on forums after this comment sinks in / propagates, but my guess is it’ll be very different than what I thought before.
People are way too quick to reward themselves for trying (my update is my priority queue in doing a proper writeup): Nate & enablers saying that productivity / irreplaceability is an excuse to triage out fundamental interpersonal effort is equivalent (as far as I’m concerned) to a 2022 University Community Builder (TM) deciding that they’re entitled to opulent retreats the moment they declare stated interest in saving the world. “For the greater good” thinking is fraught and dicey even when you’re definitely valuable enough for the case to genuinely be made, but obviously there’s pressure toward accepting a huge error rate if you simply want to believe you or a colleague is that productive/insightful. I honestly think Nate’s position here is more excusable than enablers: you basically need to see nobel physicist level output before you consider giving someone this much benefit of the doubt, and even then you should decide not to after considering it, I’m kinda dumbfounded that it was this easy for MIRI’s culture to be like this. (yes my epistemic position is going to be wrong about the stakes because “undisclosed by default”, but there are a bajillion sources of my roll to disbelieve if anyone says “well actually undisclosed MIRI codebases are nobel physicist level).
I feel very vindicated having written this comment, and I am subtracting karma from everyone who gave Nate points for writing a long introspective gdoc. You guys should’ve assumed that it would be a steep misfire.
Someone told me that some friends of theirs hated a talk or office hours with Nate, and I super devil’s advocated the idea that lots of people have reasons for disliking the “blunt because if I suffer fools we’ll all lower our standards” style that I’m not sympathetic with: I now need to apologize to them for being dismissive. I mean for chrissakes yall, in my first 1:1 with Eliezer he was not suffering fools, he helped me speedrun noticing how misled my optimism about my project at the time was and it was jovial and pleasant, so I felt like an idiot and I look back fondly on the interaction. So no, the comments about how comms style is downstream of trying to outperform those prestigious etiquette professional academics goodharting on useless but legible research that Nate retreats to elsewhere in the comments here do not hold.
Extremely from the heart warm comments about Nate from my PoV (not coming from a phonedin/trite/etiquette “soften the blow” place, but very glad that there’s that upside):
I’m a huge replacing guilt fan
reading Nate on github and lesswrong has been very important to me in my CS education. The old intelligence.org/research-guide mattered so much to me at very important life/development pivot.
Nate’s strategy / philosophy of alignment posts, particularly recently, have been phenomenal.
in a sibling comment, Nate wrote:
If you stay and try to express yourself despite experiencing strong feelings of frustration, you’re “almost yelling”. If you leave because you’re feeling a bunch of frustration and people say they don’t like talking to you while you’re feeling a bunch of frustration, you’re “storming out”.
This is hard and unfair and I absolutely feel for him, I’ve been there[1].
I don’t know if we’ve ever been in the same room. I’m going off of web presence, and very little comments or rumors others have said.
- ↩︎
on second thought: I’ve mostly only been there in say a soup kitchen run by trans commie lesbians, who are eagerly looking for the first excuse they can find to cancel the cis guy. I guess I don’t at all relate to the possibility that someone would feel that way in bay area tech scene.
For me the scary part was Meta’s willingness to do things that are minimally/arguably torment-nexusy and then put it in PR language like “cooperation” and actually with a straight face sweep the deceptive capability under the rug.
This is different from believing that the deceptive capability in question is on it’s own dangerous or surprising.
My update from cicero is almost entirely on the social reality level: I now more strongly than before believe that in the social reality, rationalization for torment-nexus-ing will be extremely viable and accessible to careless actors.
(that said, I think I may have forecasted 30-45% chance of full-press diplomacy success if you had asked me a few weeks ago, so maybe I’m not that unsurprised on the technical level)
Cheers, thanks for writing. I was very anti-math high school student, almost got expelled for throwing a temper tantrum at my algebra 2 teacher cuz I thought it wasn’t fair they were making me sit through it. That was 10th grade and they didn’t make me take any other math courses. 7 or 8 years later I took the placement exam at a community college, placed into precalc I, retreated to khanacademy and retook the exam a few months later placing into calc I, took that and discrete and all of their sequels, ended up getting straight As and tutoring every single math course there. I feel like I owe a lot to khanacademy, a tailored user experience really adds a ton of value over throwing myself at textbooks (and I did eventually figure out how to throw myself at textbooks, but also failed at doing so many times).
The purpose of my comment is to register for anyone intimidated by comments they’ve seen that imply people in the movement were doing math at such n such level in grade school that we’re out here, we exist, and we’re doing stuff; we who had to put effort into precalc in our 20s.
Abundance and scarcity; working forwards and working backwards
Chance that “AI safety basically [doesn’t need] to be solved, we’ll just solve it by default unless we’re completely completely careless”
TASP Ep 3 - Optimal Policies Tend to Seek Power
Do the best ideas float to the top?
Riffing on the agent type
[Question] What are some claims or opinions about multi-multi delegation you’ve seen in the memeplex that you think deserve scrutiny?
To be fair, me and someone tried a “select on building properties, try to cultivate entertaining properties later” strategy.
We read eachothers’ dating docs, calculated an optimism-inducing amount of goal alignment and compatibility, and took a crack at being charming and funny and sexy to eachother.
It did not go as planned. I was a little shocked—surely my monkeybrain needs would cooperate with (or be coerced into aligning with) my actual life goals, right? With hindsight I’m kind of honing my ability to recognize “just reverse engineer the ‘spark’, how hard can it be” as a special kind of stupid.
[Question] Could degoogling be a practice run for something more important?
I heard a pretty haunting take about how long it took to discover steroids in bike races. Apparently, there was a while where a “few bad apples” narrative remained popular even when an ostensibly “one of the good ones” guy was outperforming guys discovered to be using steroids.
I’m not sure how dire or cynical we should be about academic knowledge or incentives. I think it’s more or less defensible to assume that no one with a successful career is doing anything real until proven otherwise, but it’s still a very extreme view that I’d personally bet against. Of course also things vary so much field by field.
how do we know it’s false?
What is estimational programming? Squiggle in context
https://www.lesswrong.com/posts/BGLu3iCGjjcSaeeBG/related-discussion-from-thomas-kwa-s-miri-research?commentId=fPz6jxjybp4Zmn2CK This brief subthread can be read as “giving nate points for trying” and is too credulous about if “introspection” actually works—my wild background guess is that roughly 60% of the time “introspection” is more “elaborate self-delusion” than working as intended, and there are times when someone saying “no but I’m trying really hard to be good at it” drives that probability up instead of down. I didn’t think this was one of those times before reading Kurt’s comment. A more charitable view is that this prickliness (understatement) is something that’s getting triage’d out / deprioritized, not gymnastically dodged, but I think it’s unreasonable to ask people to pay attention to the difference.
That’s besides the point: the “it” was just the gdoc. “it would be a steep misfire” would mean “the gdoc tries to talk about the situation and totally does not address what matters”. The subtraction of karma was metaphorical (I don’t think I even officially voted on lesswrong!). I want to emphasize that I’m still very weak, cuz for instance I can expect people in that subthread to later tell me a detailed inside view about how giving Nate points for trying (by writing that doc) doesn’t literally mean that they were drawn into this “if von neumann has to scream at me to be productive, then it would be selfish to set a personal boundary” take, but I think it’s reasonable for me to be suspicious and cautious and look for more evidence that people would not fall for this class of “holding some people to different standards for for-the-greater-good reasons” again.
Ah yeah. I’m a bit of a believer in “introspection preys upon those smart enough to think they can do it well but not smart enough to know they’ll be bad at it”[1], at least to a partial degree. So it wouldn’t shock me if a long document wouldn’t capture what matters.
- ↩︎
epistemic status: in that sweet spot myself
- ↩︎
Ideally there would be an exceedingly high bar for strategic witholding of worldviews. I’d love some mechanism for sending downvotes to the orgs that veto’d their staff from participating! I’d love some way of socially pressuring these orgs into at least trying to convince us that they had really good reasons.
I’m pretty cynical: I assume nervous and uncalibrated shuffling of HR or legal counsel is more likely than actual defense against hazardous leakage of, say, capabilities hints.