Rational Animations’ main writer and helmsman
Writer
In this post, I appreciated two ideas in particular:
Loss as chisel
Shard Theory
“Loss as chisel” is a reminder of how loss truly does its job, and its implications on what AI systems may actually end up learning. I can’t really argue with it and it doesn’t sound new to my ear, but it just seems important to keep in mind. Alone, it justifies trying to break out of the inner/outer alignment frame. When I start reasoning in its terms, I more easily appreciate how successful alignment could realistically involve AIs that are neither outer nor inner aligned. In practice, it may be unlikely that we get a system like that. Or it may be very likely. I simply don’t know. Loss as a chisel just enables me to think better about the possibilities.
In my understanding, shard theory is, instead, a theory of how minds tend to be shaped. I don’t know if it’s true, but it sounds like something that has to be investigated. In my understanding, some people consider it a “dead end,” and I’m not sure if it’s an active line of research or not at this point. My understanding of it is limited. I’m glad I came across it though, because on its surface, it seems like a promising line of investigation to me. Even if it turns out to be a dead end I expect to learn something if I investigate why that is.
The post makes more claims motivating its overarching thesis that dropping the frame of outer/inner alignment would be good. I don’t know if I agree with the thesis, but it’s something that could plausibly be true, and many arguments here strike me as sensible. In particular, the three claims at the very beginning proved to be food for thought to me: “Robust grading is unnecessary,” “the loss function doesn’t have to robustly and directly reflect what you want,” “inner alignment to a grading procedure is unnecessary, very hard, and anti-natural.”
I also appreciated the post trying to make sense of inner and outer alignment in very precise terms, keeping in mind how deep learning and reinforcement learning work mechanistically.
I had an extremely brief irl conversation with Alex Turner a while before reading this post, in which he said he believed outer and inner alignment aren’t good frames. It was a response to me saying I wanted to cover inner and outer alignment on Rational Animations in depth. RA is still going to cover inner and outer alignment, but as a result of reading this post and the Training Stories system, I now think we should definitely also cover alternative frames and that I should read more about them.
I welcome corrections of any misunderstanding I may have of this post and related concepts.- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- 10 Jan 2024 22:04 UTC; 17 points) 's comment on The LessWrong 2022 Review: Review Phase by (
This post was published shortly before Elon Musk responded to the podcast that featured Eliezer, and Eliezer also replied to Elon Musk’s response. You can find Elon Musk’s tweet at: https://twitter.com/elonmusk/status/1628086895686168613
Also, there’s a follow-up to the podcast, still featuring Eliezer, here: https://twitter.com/i/spaces/1PlJQpZogzVGE
EDIT to update: Elon Musk is no longer following Eliezer Yudkowsky: https://twitter.com/BigTechAlert/status/1628389659649736707
EDIT 2: Lex Fridman tweets “I’d love to talk to @ESYudkowsky. I think it’ll be a great conversation!” https://twitter.com/lexfridman/status/1620251244463022081
EDIT 3: Sam Altman posts a selfie with Eliezer and Grimes: https://twitter.com/sama/status/1628974165335379973
EDIT 4:
Elon Musk: “Having a bit of AI Existential angst today” https://twitter.com/elonmusk/status/1629901954234105857
Eliezer Yudkowsky replies (https://twitter.com/ESYudkowsky/status/1629932013712187395): “Remember that many things you could do to relieve your angst are actively counterproductive! Don’t give into the fallacy of “needing to do something” even if that makes things worse! Prove the prediction markets wrong about you!”
EDIT 5: From this Reuters article. Elon Musk: “I’m a little worried about the AI stuff [...] We need some kind of, like, regulatory authority or something overseeing AI development [...] make sure it’s operating in the public interest. It’s quite dangerous technology. I fear I may have done some things to accelerate it.”
EDIT 6: Eliezer: “I should probably try another podcast [...] YES FINE I’LL INQUIRE OF LEX FRIDMAN” https://twitter.com/ESYudkowsky/status/1632140761679675392
EDIT 7: Elon Musk: “In my case, I guess it would be the Luigi effect”: https://twitter.com/elonmusk/status/1632487656742420483EDIT 8: Another exchange between Elon Musk and Eliezer: https://twitter.com/elonmusk/status/1637176761220833281
EDIT 9: Elon Musk tweets: “Maximum truth-seeking is my best guess for AI safety”: https://twitter.com/elonmusk/status/1637371603561398276Edit 10: Yan LeCunn on Twitter:
I think that the magnitude of the AI alignment problem has been ridiculously overblown & our ability to solve it widely underestimated. I’ve been publicly called stupid before, but never as often as by the “AI is a significant existential risk” crowd. That’s OK, I’m used to it.
https://twitter.com/ylecun/status/1637883960578682883
Hofstadter too!
A guess: if LessWrong implemented this, onboarding lots of new users at once would be easier to do without ruining the culture for the people already here.
I (Gretta) will be leading the communications team at MIRI, working with Rob Bensinger, Colm Ó Riain, Nate, Eliezer, and other staff to create succinct, effective ways of explaining the extreme risks posed by smarter-than-human AI and what we think should be done about this problem.
I just sent an invite to Eliezer to Rational Animations’ private Discord server so that he can dump some thoughts on Rational Animations’ writers. It’s something we decided to do when we met at Manifest. The idea is that we could distill his infodumps into something succinct to be animated.
That said, if in the future you have from the outset some succinct and optimized material you think we could help spread to a wide audience and/or would benefit from being animated, we can likely turn your writings into animations on Rational Animations, as we already did for a few articles in The Sequences.
The same invitation extends to every AI Safety organization.
EDIT: Also, let me know if more of MIRI’s staff would like to join that server, since it seems like what you’re trying to achieve with comms overlaps with what we’re trying to do. That server basically serves as central point of organization for all the work happening at Rational Animations.
Looking for someone in Japan who had experience with guns in games, he looked on twitter and found someone posting gun reloading animations
Having interacted with animation studios and being generally pretty embedded in this world, I know that many studios are doing similar things, such as Twitter callouts if they need some contractors fast for some projects. Even established anime studios do this. I know at least two people who got to work on Japanese anime thanks to Twitter interactions.
I hired animators through Twitter myself, using a similar process: I see someone who seems really talented → I reach out → they accept if the offer is good enough for them.
If that’s the case for animation, I’m pretty sure it often applies to video games, too.
Of all the videos we’ve done, this one elicits, by far, the strongest emotional reaction in me. Part of it is due to the essay. I found it invigorating when I first read it, but then the thought of an alien intelligence taking apart our planet, almost as inevitable as a law of physics, haunted me for a good while. Part of it is also the animation and visuals. The colors are intense, and the scenes are sometimes unsettling and violent, but still beautiful. Nature, but softened.
“April fool! It was not an April fool!”
Robin Hanson, you know nothing about Robin Hanson. You first wrote the paper in 1996 and then last updated it in 1998.
… or so says Wikipedia, that’s why I wrote 1996. I just made this clear in the video description anyway, tell me if Wikipedia got this wrong.
Btw, views have nicely snowballed from your endorsement on Twitter, so thanks a lot for it.
Note that in a later Tweet she said she was psychotic at the time
Edit: and also in this one.
I created a market on Manifold about whether the EA Forum or LW will start using EigenKarma by 2025.
The comments under this video seem okayish to me, but maybe it’s because I’m calibrated on worse stuff under past videos, which isn’t necessarily very good news to you.
The worst I’m seeing is people grinding their own different axes, which isn’t necessarily indicative of misunderstanding.But there are also regular commenters who are leaving pretty good comments:
The other comments I see range from amused and kinda joking about the topic to decent points overall. These are the top three in terms of popularity at the moment:
This recent Tweet by Sam Altman lends some more credence to this post’s take:
RA has started producing shorts. Here’s the first one using original animation and script: https://www.youtube.com/shorts/4xS3yykCIHU
The LW short-form feed seems like a good place for posting some of them.
This is infuriating somehow lol
i think about this story from time to time. it speaks to my soul.
it is cool that straight-up utopian fiction can have this effect on me.
it yanks me in a state of longing. it’s as if i lost this world a long time ago, and i’m desperately trying to regain it.
i truly wish everything will be ok :,)
thank you for this, tamsin.
Was Bing responding in Tibetan to some emojis already discussed on LW? I can’t find a previous discussion about it here. I would have expected people to find this phenomenon after the SolidGoldMagikarp post, unless it’s a new failure mode for some reason.
If you just had to pick one, go for The Goddess of Everything Else.
Here’s a short list of my favorites.
In terms of animation:
- The Goddess of Everything Else
- The Hidden Complexity of Wishes
- The Power of Intelligence
In terms of explainer:- Humanity was born way ahead of its time. The reason is grabby aliens. [written by me]
- Everything might change forever this century (or we’ll go extinct). [mostly written by Matthew Barnett]
Also, I’ve sent the Discord invite.
I agree that MIRI’s initial replies don’t seem to address your points and seem to be straw-manning you. But there is one point they’ve made, which appears in some comments, that seems central to me. I could translate it in this way to more explicitly tie it to your post:
”Even if GPT-N can answer questions about whether outcomes are bad or good, thereby providing “a value function”, that value function is still a proxy for human values since what the system is doing is still just relaying answers that would make humans give thumbs up or thumbs down.”
To me, this seems like the strongest objection. You haven’t solved the value specification problem if your value function is still a proxy that can be goodharted etc.If you think about it in this way, then it seems like the specification problem gets moved to the procedure you use to finetune large language models to make them able to give answers about human values. If the training mechanism you use to “lift” human values out of LLM’s predictive model is imperfect, then the answers you get won’t be good enough to build a value function that we can trust.
That said, we have GPT-4 now, and with better subsequent alignment techniques, I’m not so sure we won’t be able to get an actual good value function by querying some more advanced and better-aligned language model and then using it as a training signal for something more agentic. And yeah, at that point, we still have the inner alignment part to solve, granted that we solve the value function part, and I’m not sure we should be a lot more optimistic than before having considered all these arguments. Maybe somewhat, though, yeah.
Why doesn’t it improve on AP English Literature and AP English Language?