james.lucassen
My attempt to break down the key claims here:
The internet is causing rapid memetic evolution towards ideas which stick in people’s minds, encourage them to take certain actions, especially ones that spread the idea. Ex: wokism, Communism, QAnon, etc
These memes push people who host them (all of us, to be clear) towards behaviors which are not in the best interests of humanity, because Orthogonality Thesis
The lack of will to work on AI risk comes from these memes’ general interference with clarity/agency, plus selective pressure to develop ways to get past “immune” systems which allow clarity/agency
Before you can work effectively on AI stuff, you have to clear out the misaligned memes stuck in your head. This can get you the clarity/agency necessary, and make sure that (if successful) you actually produce AGI aligned with “you”, not some meme
The global scale is too big for individuals—we need memes to coordinate us. This is why we shouldn’t try and just solve x-risk, we should focus on rationality, cultivating our internal meme garden, and favoring memes which will push the world in the direction we want it to go
When you have a self-image as a productive, hardworking person, the usual Marshmallow Test gets kind of reversed. Normally, there’s some unpleasant task you have to do which is beneficial in the long run. But in the Reverse Marshmallow Test, forcing yourself to work too hard makes you feel Good and Virtuous in the short run but leads to burnout in the long run. I think conceptualizing of it this way has been helpful for me.
Optimization and Adequacy in Five Bullets
Moravec’s Paradox Comes From The Availability Heuristic
Strategy For Conditioning Generative Models
Attempts at Forwarding Speed Priors
Evaluating Stability of Unreflective Alignment
I think a lot of this discussion becomes clearer if we taboo “intelligence” as something like “ability to search and select a high-ranked option from a large pool of strategies”.
Agree that the rate-limiting step for a superhuman intelligence trying to affect the world will probably be stuff that does not scale very well with intelligence, like large-scale transport, construction, smelting widgets, etc. However, I’m not sure it would be so severe a limitation as to produce situations like what you describe, where a superhuman intelligence sits around for a month waiting for more niobium. The more strategies you are able to search over, the more likely it is that you’ll hit on a faster way of getting niobium.
Agree that being able to maneuver in human society and simulate/manipulate humans socially would probably be much more difficult for a non-human intelligence than some other tasks humans might think of as equally difficult, since humans have a bunch of special-purpose mechanisms for that kind of thing. That being said, I’m not convinced it is so hard as to be practically impossible for any non-human to do. The amount of search power it took evolution to find those abilities isn’t so staggering that it could never be matched.
I’m pretty surprised by the position that “intelligence is [not] incredibly useful for, well, anything”. This seems much more extreme than the position that “intelligence won’t solve literally everything”, and like it requires an alternative explanation of the success of homo sapiens.
Thank you for posting this! There’s a lot of stuff I’m not mentioning because confirming agreements all the time makes for a lot of comment clutter, but there’s plenty of stuff to chew on here. In particular, the historical rate of scientific progress seems like a real puzzle that requires some explanation.
Putting this in a separate comment, because Reign of Terror moderation scares me and I want to compartmentalize. I am still unclear about the following things:
Why do we think memetic evolution will produce complex/powerful results? It seems like the mutation rate is much, much higher than biological evolution.
Valentine describes these memes as superintelligences, as “noticing” things, and generally being agents. Are these superintelligences hosted per-instance-of-meme, with many stuffed into each human? Or is something like “QAnon” kind of a distributed intelligence, doing its “thinking” through social interactions? Both of these models seem to have some problems (power/speed), so maybe something else?
Misaligned (digital) AGI doesn’t seem like it’ll be a manifestation of some existing meme and therefore misaligned, it seems more like it’ll just be some new misaligned agent. There is no highly viral meme going around right now about producing tons of paperclips.
This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?
“Just Retarget the Search” directly eliminates the inner alignment problem.
I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you’re willing to assume that our interpretability tools are so good they can’t ever be tricked, you have to deal with that.
It’s not necessarily a huge issue—hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it’s not just “bada-bing bada-boom” exactly.
In Search of Strategic Clarity
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn’t seem very useful.
- Say I have an intervention that’s helpful, and has a baseline 1⁄4 probability. If I condition on this statement, I get 1 “unit of helpfulness”, and a 4x update towards manipulative AGI.
- Now let’s say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1⁄4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I’d done no simulation at all, I would’ve gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1⁄8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1⁄8, p(O | manipulative) = 1⁄4, so I get a 2x update towards manipulative AGI. This is the same as if I’d just conditioned on the statement “one of my four interventions happens”, and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn’t doing anything different from just using a weaker search condition—it gives up bits of search, and so it has to pay less.
Nice post!
perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.
I don’t think this gets at the core difficulty of speed priors not generalizing well. Let’s we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don’t generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to pass with is most probable on the speed prior, i.e. the minimum number of additional entries in the lookup table to pass the training set.
Here’s maybe an argument that the generalization/deception tradeoff is unavoidable: according to the Mingard et al. picture of why neural nets generalize, it’s basically just that they are cheap approximations of rejection sampling with a simplicity prior. This generalizes by relying on the fact that reality is, in fact, simplicity biased—it’s just Ockham’s razor. On this picture of things, neural nets generalize exactly to the extent that they approximate a simplicity prior, and so any attempt to sample according to an alternative prior will lose generalization ability as it gets “further” from the True Distribution of the World.
Ah, so on this view, the endgame doesn’t look like
“make technical progress until the alignment tax is low enough that policy folks or other AI-risk-aware people in key positions will be able to get an unaware world to pay it”
But instead looks more like
“get the world to be aware enough to not bumble into an apocalypse, specifically by promoting rationality, which will let key decision-makers clear out the misaligned memes that keep them from seeing clearly”
Is that a fair summary? If so, I’m pretty skeptical of the proposed AI alignment strategy, even conditional on this strong memetic selection and orthogonality actually happening. It seems like this strategy requires pretty deeply influencing the worldview of many world leaders. That is obviously very difficult because no movement that I’m aware of has done it (at least, quickly), and I think they all would like to if they judged it doable. Importantly, the reduce-tax strategy requires clarifying and solving a complicated philosophical/technical problem, which is also very difficult. I think it’s more promising for the following reasons:
It has a stronger precedent (historical examples I’d reference include the invention of computability theory, the invention of information theory and cybernetics, and the adventures in logic leading up to Godel)
It’s more in line with rationalists’ general skill set, since the group is much more skewed towards analytical thinking and technical problem-solving than towards government/policy folks and being influential among those kinds of people
The number of people we would need to influence will go up as AGI tech becomes easier to develop, and every one is a single point of failure.
To be fair, these strategies are not in a strict either/or, and luckily use largely separate talent pools. But if the proposal here ultimately comes down to moving fungible resources towards the become-aware strategy and away from the technical-alignment strategy, I think I (mid-tentatively) disagree
The first enigma seems like it’s either very closely related or identical to Hume’s problem of induction. If that is a fair-rephrasing, then I think it’s not entirely true that the key problem is that the use of empiricism cannot be justified by empiricism or refuted by empiricism. Principles like “don’t believe in kludgy unwieldy things” and “empiricism is a good foundation for belief” can in fact be supported by empiricism—because those heuristics have worked well in the past, and helped us build houses and whatnot.
I think the key problem is that empiricism both supports and refutes the claim “I know empiricism works because empirically it’s always worked well in the past”. This statement is empirically supported because empiricism has worked well in the past, but it’s also circular, and circular reasoning has not generally worked well in the past.
This can also be re-phrased as a conflict between object-level and meta-reasoning. On the object level, empiricism supports empiricism. But on the meta level, empiricism rejects circular reasoning.
My best guess at mechanism:
Before, I was a person who prided myself on succeeding at marshmallow tests. This caused me to frame work as a thing I want to succeed on, and work too hard.
Then, I read Meaningful Rest and Replacing Guilt, and realized that often times I was working later to get more done that day, even though it would obviously be detrimental to the next day. This makes the reverse marshmallow test dynamic very intuitively obvious.
Now I am still a person who prides myself on my marshmallow prowess, but hopefully I’ve internalized an externality or something. Staying up late to work doesn’t feel Good and Virtuous, it feels Bad and like I’m knowingly Goodharting myself.
Note that this all still boils down to narrative-stuff. I’m nowhere near the level of zen that it takes to Just Pursue The Goal, with no intermediating narratives or drives based on self-image. I don’t think this patch has been particularly moved me towards that either, it’s just helpful for where I’m currently at.
Dang, I wish I had read this before the EA Forum’s creative writing contest closed. It makes a lot of sense that HPMOR could be valuable via this “first-person-optimizing-experience” mechanism—I had read it after reading the Sequences, so I was mostly looking for examples of rationality techniques and secret hidden Jedi knowledge.
Since HPMOR!Harry isn’t so much EA as transhumanist, I wonder if a first-person EA experience could be made interesting enough to be a useful story? I suppose the Comet King from Unsong is also kind of close to this niche, but not really described in first person or designed to be related to. This might be worth a stab...
Trying to figure out what’s being said here. My best guess is two major points:
Meta doesn’t work. Do the thing, stop trying to figure out systematic ways to do the thing better, they’re a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it’s what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it’s useless.
Agree that this is definitely a plausible strategy, and that it doesn’t get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:
How did we get here?
If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side.
In early days, focus on the technical problem makes sense. When you are the only person in the world working on AGI, all the delay in the world won’t help unless the alignment problem gets solved. But we are working at very different margins nowadays.
There’s also an obvious trap which makes motivated reasoning really easy. Often, the first thing that occurs when thinking about slowing down AGI development is sabotage—maybe because this feels urgent and drastic? It’s an obviously bad idea, and maybe that lets us to motivated stopping.
Maybe the “technical/policy” dichotomy is keeping us from thinking of obvious ways we could be making the future much safer? The outreach org you propose seems like not really either. Would be interested in brainstorming other major ways to affect the world, but not gonna do that in this comment.
HEY! FTX! OVER HERE!!
You should submit this to the Future Fund’s ideas competition, even though it’s technically closed. I’m really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I’ve done a more detailed brainstorm.