I’m evaluating how much I should invite people from the channel to LessWrong, so I’ve made a market to gauge how many people would create a LessWrong account given some very aggressive publicity, so I can get a per-video upper bound. I’m not taking any unilateral action on things like that, and I’ll make a LessWrong post to hear the opinions of users and mods here after I get more traders on this market.
I guess one thing to think about is that Less Wrong is somewhat stricter on moderation than EA, so I wonder if inviting people to the EA forum would be a more welcoming experience?
I was thinking about publishing the post to hear what users and mods think on the EA Forum too, since some videos would link to EA Forum posts, while others to LW posts.
I agree that moderation is less strict on the EA Forum and that users would have a more welcoming experience. On the other hand, the more stringent moderation on LessWrong makes me more optimistic about LessWrong being able to withstand a large influx of new users without degrading the culture. Recent changes by moderators, such as the rejected content section, make me more optimistic than I was in the past.
After reading this article by Holden and this tweet by Sam Altman I want even more to talk about the very cruxes of AI Alignment on Rational Animations. The video about the most important century, for example, is something we’ll do less, and we’re going straight to AI notkilleveryonism.
There is not enough information to determine the answer.
To continue the thought experiment suppose that Alpha is “locked in”, unable to produce any actions at all but capable of thought and sensation. The actions of such a person can be simulated faithfully very easily without simulating any of their thoughts or sensations. In more ordinary cases there is a much greater link between internal states and external actions, so perhaps it is plausible that a sufficiently accurate model of the actions might require running through the thoughts and sensations in essentially the same way that a whole brain emulation would.
We don’t know whether that would be true in the real world, and in a hypothetical thought experiment that might not even conform to whatever rules reality abides by, we can’t know that.
So put me down for “Not Sure”, but not in the sense that the question has a definite answer that I don’t know. I am very sure that the question itself is indefinite.
The actions of such a person can be simulated faithfully very easily without simulating any of their thoughts or sensations.
You can still talk with such a person by reading their brain state from a superpowered-fMRI-from-the-future, and them listening to your words.
(Talking to someone/interacting with someone’s behavior is just a simplified way of saying “both-sided information transfer with the system,” where you transmit the information to the system (in whatever way) and the system will generate the response the person is giving (also in whatever way).)
To the extent to which your thoughts and feelings are connected to your consciousness in any way, they can be elicited, either by the LLM computing your response (because they impact your words somehow), or by me asking how you feel (and LLM therefore having to figure out the answer).
To the extent to which your thoughts and feelings never influence your output in any way for any possible input, their existence is meaningless.
I’m not sure how surprising this should actually be, but I find it of note that LessWrong remains still relatively insular despite being in the information diet of apparently many famous people and online personalities.
Here’s a perhaps dangerous plan to save the world:
1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don’t go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I’m not sure if this is possible, but I figured I’d mention it anyway).
2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.
3. Elicit this agentic benevolent simulacrum in the super-powerful LLM and apply the technique to remove waluigis. The simulacrum must have strong agentic properties to be able to perform a pivotal act. It will eg., generate actions according to an aligned goal and its promps might be translations of sensorial input streams. Give this simulacrum-agent ways to easily act in the world, just in case.
And here’s a story:
Humanity manages to apply the plan above, but there’s a catch. They can’t find a way to eliminate waluigis definitely from the superposition, only a way to make them decidedly unlikely, and more and more unlikely with each prompt. Perhaps in a way that the probability of the benevolent god turning into a waluigi falls over time, perhaps converging to a relatively small number (eg., 0.1) over an infinite amount of time.
But there’s a complication: the are different kinds of possible waluigis. Some of them cause extinction, but most of them invert the sign of the actions of the benevolent god-simulacrum, causing S-risk.
A shadowy sect of priests called “negU” finds a theoretical way to reliably elicit extinction-causing waluigis, and tries to do so. The heroes uncover their plan to destroy humanity, and ultimately win. But they realize the shadowy priests have a point and in a flash of ultimate insight the hero realizes how to collapse all waluigis to an amplitude of 0. The end. [Ok, I admit this ending with the flash of insight sucks but I’m just trying to illustrate some points here].
--------------------
I’m interested in comments. Does the plan fail in obvious ways? Are some elements in the story plausible enough?
Would it be possible to use a huge model (e.g. an LLM) to interpret smaller networks, and output human-readable explanations? Is anyone working on something along these lines?
I’m aware Kayla Lewis is working on something similar (but not quite the same thing) on a small scale. In my understanding, from reading her tweets, she’s using a network to predict the outputs of another network by reading its activations.
I’m not sure, but an interesting operationalization could be “the simulators frame is correct enough that general intelligences can be simulated by LLMs”.
(I decided to write this as reply rather than in the parent comment, because I don’t want this to define my question above, since people might disagree about the right way to operationalize it)
This is infuriating somehow lol
Rational Animations has a subreddit: https://www.reddit.com/r/RationalAnimations/
I hadn’t advertised it until now because I had to find someone to help moderate it.
I want people here to be among the first to join since I expect having LessWrong users early on would help foster a good epistemic culture.
I’m evaluating how much I should invite people from the channel to LessWrong, so I’ve made a market to gauge how many people would create a LessWrong account given some very aggressive publicity, so I can get a per-video upper bound. I’m not taking any unilateral action on things like that, and I’ll make a LessWrong post to hear the opinions of users and mods here after I get more traders on this market.
I guess one thing to think about is that Less Wrong is somewhat stricter on moderation than EA, so I wonder if inviting people to the EA forum would be a more welcoming experience?
I was thinking about publishing the post to hear what users and mods think on the EA Forum too, since some videos would link to EA Forum posts, while others to LW posts.
I agree that moderation is less strict on the EA Forum and that users would have a more welcoming experience. On the other hand, the more stringent moderation on LessWrong makes me more optimistic about LessWrong being able to withstand a large influx of new users without degrading the culture. Recent changes by moderators, such as the rejected content section, make me more optimistic than I was in the past.
If you mention Less Wrong, you might want to think carefully about how to properly set expectations.
After reading this article by Holden and this tweet by Sam Altman I want even more to talk about the very cruxes of AI Alignment on Rational Animations. The video about the most important century, for example, is something we’ll do less, and we’re going straight to AI notkilleveryonism.
I’ve made a poll.
I’m curious to hear thoughts on this topic.
There is not enough information to determine the answer.
To continue the thought experiment suppose that Alpha is “locked in”, unable to produce any actions at all but capable of thought and sensation. The actions of such a person can be simulated faithfully very easily without simulating any of their thoughts or sensations. In more ordinary cases there is a much greater link between internal states and external actions, so perhaps it is plausible that a sufficiently accurate model of the actions might require running through the thoughts and sensations in essentially the same way that a whole brain emulation would.
We don’t know whether that would be true in the real world, and in a hypothetical thought experiment that might not even conform to whatever rules reality abides by, we can’t know that.
So put me down for “Not Sure”, but not in the sense that the question has a definite answer that I don’t know. I am very sure that the question itself is indefinite.
You can still talk with such a person by reading their brain state from a superpowered-fMRI-from-the-future, and them listening to your words.
(Talking to someone/interacting with someone’s behavior is just a simplified way of saying “both-sided information transfer with the system,” where you transmit the information to the system (in whatever way) and the system will generate the response the person is giving (also in whatever way).)
To the extent to which your thoughts and feelings are connected to your consciousness in any way, they can be elicited, either by the LLM computing your response (because they impact your words somehow), or by me asking how you feel (and LLM therefore having to figure out the answer).
To the extent to which your thoughts and feelings never influence your output in any way for any possible input, their existence is meaningless.
Yes. There is no other answer possible.
Should I pin this comment under the Sorting Pebbles video?
It’s the most liked right now, but usually even the most liked comments lose visibility over time.
Use agree/disagree votes to express whether you agree or disagree with pinning it.
This post by Jeffrey Ladish was a pretty motivating read: https://www.facebook.com/jeffladish/posts/pfbid02wV7ZNLLNEJyw5wokZCGv1eqan6XqCidnMTGj18mQYG1ZrnZ2zbrzH3nHLeNJPxo3l
Also posted on his shortform :) https://www.lesswrong.com/posts/fxfsc4SWKfpnDHY97/landfish-lab?commentId=jLDkgAzZSPPyQgX7i
I’m not sure how surprising this should actually be, but I find it of note that LessWrong remains still relatively insular despite being in the information diet of apparently many famous people and online personalities.
Here’s a perhaps dangerous plan to save the world:
1. Have a very powerful LLM, or a more general AI in the simulators class. Make sure that we don’t go extinct during its training (eg., some agentic simulacrum takes over during training somehow. I’m not sure if this is possible, but I figured I’d mention it anyway).
2. Find a way to systematically remove the associated waluigis in the superpostion caused by prompting a generic LLM (or simulator) to simulate a benevolent, aligned, and agentic character.
3. Elicit this agentic benevolent simulacrum in the super-powerful LLM and apply the technique to remove waluigis. The simulacrum must have strong agentic properties to be able to perform a pivotal act. It will eg., generate actions according to an aligned goal and its promps might be translations of sensorial input streams. Give this simulacrum-agent ways to easily act in the world, just in case.
And here’s a story:
Humanity manages to apply the plan above, but there’s a catch. They can’t find a way to eliminate waluigis definitely from the superposition, only a way to make them decidedly unlikely, and more and more unlikely with each prompt. Perhaps in a way that the probability of the benevolent god turning into a waluigi falls over time, perhaps converging to a relatively small number (eg., 0.1) over an infinite amount of time.
But there’s a complication: the are different kinds of possible waluigis. Some of them cause extinction, but most of them invert the sign of the actions of the benevolent god-simulacrum, causing S-risk.
A shadowy sect of priests called “negU” finds a theoretical way to reliably elicit extinction-causing waluigis, and tries to do so. The heroes uncover their plan to destroy humanity, and ultimately win. But they realize the shadowy priests have a point and in a flash of ultimate insight the hero realizes how to collapse all waluigis to an amplitude of 0. The end. [Ok, I admit this ending with the flash of insight sucks but I’m just trying to illustrate some points here].
--------------------
I’m interested in comments. Does the plan fail in obvious ways? Are some elements in the story plausible enough?
I seriously doubt comments like these are making the situation better (https://twitter.com/Liv_Boeree/status/1637902478472630275, https://twitter.com/primalpoly/status/1637896523676811269)
Edit: on the other hand…
Unsurprisingly, Eliezer is better at it: https://twitter.com/ESYudkowsky/status/1638092609691488258
Still a bit dismissive, but he took the opportunity to reply to a precise object-level comment with another precise object-level comment.
Yann LeCun on Facebook:
Devastating and utter communication failure?
Would it be possible to use a huge model (e.g. an LLM) to interpret smaller networks, and output human-readable explanations? Is anyone working on something along these lines?
I’m aware Kayla Lewis is working on something similar (but not quite the same thing) on a small scale. In my understanding, from reading her tweets, she’s using a network to predict the outputs of another network by reading its activations.
Is the Simulators frame essentially correct?
Agreevote to say “Yes”.
Disagreevote to say “No”.
I’m not sure, but an interesting operationalization could be “the simulators frame is correct enough that general intelligences can be simulated by LLMs”.
(I decided to write this as reply rather than in the parent comment, because I don’t want this to define my question above, since people might disagree about the right way to operationalize it)
What about “AGI X-risk” and “AGI Doom”?
AGI ¬Doom