Theoretical AI alignment (and relevant upskilling) in my free time. My current view of the field is here (part 1) and here (part 2).
NicholasKross
This is pretty good, although you ought to actually link more of the citations for specific facts. Court documents, testimonies, company docs, even Wikipedia would’ve been helpful for the part about IAG.
(Semi-dumb LW category suggestion: Posts That Could Have Made You Good Money In Hindsight)
OpenAI should have an internal-only directory allowing employees and leadership to write up and see each other’s beliefs about AI extinction risk and alignment approaches.
FWIW: I volunteered for Nonlinear in summer 2021, and the people behind it are pretty on-top-of-things!
I’ve noticed a thing happening (more? lately? just in my reading sample?) similar to what you describe, where the emphasis goes more onto the social/community side of rationality as opposed to… the rest of rationality.
The Moral Mazes examples are related to that. Also topics like reputation, and virtues ‘n’ norms, and what other people think of you.
At some point, a person’s energy and resources are finite. They can try to win at anything, but maybe the lesson from recent writings is “winning at social anything is hard enough (for a LW-frequenting personality) to be a notable problem”.
Some thoughts on this issue:
-
Codify, codify, codify. Most people in the LW community are lacking in some social skills (relative to both non-members and the professional-politician standard). Those who have those skills: please make long detailed checklists and email-extensions of what works. That way, the less-socially-skilled among us can avoid losing-at-social without turning into Mad-Eye Moody and losing our energy.
-
Is there a trend where communities beat around the bush more over time? Many posts do what I’ve heard called “subtweeting”. “Imagine a person X, having to do thing Y, and problem Z happens...”. Yes, social game theory exists and reputation exists, but at least consider just telling people the details.
Common/game-theory/vague/bad: “Let’s say somebody goes to $ORG, but they do something bad. We should consider $ORG and everyone there to be infected with The Stinky.”
Better/precise/detailed/good: “Hey, Nicholas Kross went to MIRI and schemed to build a robot that outputs anti-utils. How do we prevent this in the future, and can we make a preventative checklist?” [1]
If you are totally financially/legally dependent on an abusive organization or person, obviously writing a call-out post with details is game-theoretically bad for you. In that case, don’t leave in those details. For everyone else: either write a postmortem or say “I’m under NDA, but...”.
We get it, we need Slack, and society doesn’t give enough of it for our purposes. Can somebody with higher dopamine coordinate or promote any method so we can setup living arrangements to escape mainstream social pressures?
(If your AGI-will-give-us-Slack timeline is shorter than a community-Slack-project, how much should you really worry about long-term politics-style social/reputational-game-theoretic threats to the community’s Slack?)
Interested in more thoughts on this.
- ↩︎
This is a fictional example. Plus, it’s not even slyly alluding to any situations! (Well, as far as I know.)
-
This makes me wonder if operations really is the big bottleneck with EA/AIS/MIRI in particular. Anyone good at operations, would not have let this situation happen. Either it would’ve ended quicker (“Sorry, we’re not hiring right now, try again later.”) or ended differently (“You’re hired”).
Good catch on the natural-vs-man-made accidental bait-and-switch in the common argument. This post changed my mind to think that, at least for scaling-heavy AI (and, uh, any disaster that leaves the government standing), regulation could totally help the overall situation.
Much cheaper, though still hokey, ideas that you should have already thought of at some point:
A “formalization office” that checks and formalizes results by alignment researchers. It should not take months for a John Wentworth result to get formalized by someone else.
Alignment-specific outreach at campuses/conventions with top cybersecurity people.
- Upgrading the AI Safety Community by 16 Dec 2023 15:34 UTC; 41 points) (
- 16 Aug 2023 0:58 UTC; 2 points) 's comment on A rough and incomplete review of some of John Wentworth’s research by (
Person tries to work on AI alignment.
Person fails due to various factors.
Person gives up working on AI alignment. (This is probably a good move, when it’s not your fit, as is your case.)
Danger zone: In ways that sort-of-rationalize-around their existing decision to give up working on AI alignment, the person starts renovating their belief system around what feels helpful to their mental health. (I don’t know if people are usually doing this after having already tried standard medical-type treatments, or instead of trying those treatments.)
Danger zone: Person announces this shift to others, in a way that’s maybe and/or implicitly prescriptive (example).
There are, depressingly, many such cases of this pattern. (Related post with more details on this pattern.)
This seems… testable? Like, it’s kind of the opposite message of Yudkowsky’s “try harder” posts.
Have two groups work on a research problem. One is in doom mode, one is in sober mode. See which group makes more progress.
Almost upvoted for kiiiiinda describing the actual mental model. Not upvoted because:
Still tried to be cool and edgy with the writing style, when we’ve already established that this is a dumb idea with this topic. No, I’m not moving to the woods with you to get the “real version”.
No illustrations or diagrams, when talking about something extremely meta-level and slippery-language-filled.
WaitButWhy does this genre better (describing psychology helpfully using very fake analogies).
Don’t forget Orthogonal’s mathematical alignment research, including QACI!
Possibly relevant: Weak Men are Superweapons.
This definitely describes my experience, and gave me a bit of help in correcting course, so thank you.
Also, I recall an Aella tweet where she claimed that some mental/emotional problems might be normal reactions to having low status and/or not doing much interesting in life. Partly since, in her own experience, those problems were mostly(?) alleviated when she started “doing more awesome stuff”.
Even after thinking through these issues in SERI-MATS, and already agreeing with at least most of this post, I was surprised upon reading it how many new-or-newish-to-me ideas and links it contained.
I’m not sure if that’s more of a failure of me, or of the alignment field to notice “things that are common between a diverse array of problems faced”. Kind of related to my hunch that multiple alignment concepts (“goals”, “boundaries”, “optimization”) will turn out to be isomorphic to the same tiny-handful of mathematical objects.
Now I wonder if there should be some (public? private?) thing kinda like MyAnimeList, but for papers. So you can track which ones you’ve “dropped” vs kept-reading, summarize (“review”) the takeaways/quality, etc.
Eagerly awaiting the Massive LessWrong Post Explaining And Comparing Capitalism And Socialism With Lots Of Steelmanned Arguments, General Principles, And Consequentialism. It’d take a lot of work to write well, though...
There is something weirdly powerful about these being in well-printed form, more approachable in form and content. This is something easier to recommend to friends without having to go “By the way this post references a thing that is no longer relevant / is poorly researched” like the Robber’s Cave mentioned.
Extremely strong upvote for Oliver’s 2nd message.
Also, not as related: kudos for actually materially changing the course of your organization, something which is hard for most organizations, period.