james.lucassen
My attempt to break down the key claims here:
The internet is causing rapid memetic evolution towards ideas which stick in people’s minds, encourage them to take certain actions, especially ones that spread the idea. Ex: wokism, Communism, QAnon, etc
These memes push people who host them (all of us, to be clear) towards behaviors which are not in the best interests of humanity, because Orthogonality Thesis
The lack of will to work on AI risk comes from these memes’ general interference with clarity/agency, plus selective pressure to develop ways to get past “immune” systems which allow clarity/agency
Before you can work effectively on AI stuff, you have to clear out the misaligned memes stuck in your head. This can get you the clarity/agency necessary, and make sure that (if successful) you actually produce AGI aligned with “you”, not some meme
The global scale is too big for individuals—we need memes to coordinate us. This is why we shouldn’t try and just solve x-risk, we should focus on rationality, cultivating our internal meme garden, and favoring memes which will push the world in the direction we want it to go
When you have a self-image as a productive, hardworking person, the usual Marshmallow Test gets kind of reversed. Normally, there’s some unpleasant task you have to do which is beneficial in the long run. But in the Reverse Marshmallow Test, forcing yourself to work too hard makes you feel Good and Virtuous in the short run but leads to burnout in the long run. I think conceptualizing of it this way has been helpful for me.
I think a lot of this discussion becomes clearer if we taboo “intelligence” as something like “ability to search and select a high-ranked option from a large pool of strategies”.
Agree that the rate-limiting step for a superhuman intelligence trying to affect the world will probably be stuff that does not scale very well with intelligence, like large-scale transport, construction, smelting widgets, etc. However, I’m not sure it would be so severe a limitation as to produce situations like what you describe, where a superhuman intelligence sits around for a month waiting for more niobium. The more strategies you are able to search over, the more likely it is that you’ll hit on a faster way of getting niobium.
Agree that being able to maneuver in human society and simulate/manipulate humans socially would probably be much more difficult for a non-human intelligence than some other tasks humans might think of as equally difficult, since humans have a bunch of special-purpose mechanisms for that kind of thing. That being said, I’m not convinced it is so hard as to be practically impossible for any non-human to do. The amount of search power it took evolution to find those abilities isn’t so staggering that it could never be matched.
I’m pretty surprised by the position that “intelligence is [not] incredibly useful for, well, anything”. This seems much more extreme than the position that “intelligence won’t solve literally everything”, and like it requires an alternative explanation of the success of homo sapiens.
Thank you for posting this! There’s a lot of stuff I’m not mentioning because confirming agreements all the time makes for a lot of comment clutter, but there’s plenty of stuff to chew on here. In particular, the historical rate of scientific progress seems like a real puzzle that requires some explanation.
Putting this in a separate comment, because Reign of Terror moderation scares me and I want to compartmentalize. I am still unclear about the following things:
Why do we think memetic evolution will produce complex/powerful results? It seems like the mutation rate is much, much higher than biological evolution.
Valentine describes these memes as superintelligences, as “noticing” things, and generally being agents. Are these superintelligences hosted per-instance-of-meme, with many stuffed into each human? Or is something like “QAnon” kind of a distributed intelligence, doing its “thinking” through social interactions? Both of these models seem to have some problems (power/speed), so maybe something else?
Misaligned (digital) AGI doesn’t seem like it’ll be a manifestation of some existing meme and therefore misaligned, it seems more like it’ll just be some new misaligned agent. There is no highly viral meme going around right now about producing tons of paperclips.
This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?
“Just Retarget the Search” directly eliminates the inner alignment problem.
I think deception is still an issue here. A deceptive agent will try to obfuscate its goals, so unless you’re willing to assume that our interpretability tools are so good they can’t ever be tricked, you have to deal with that.
It’s not necessarily a huge issue—hopefully with interpretability tools this good we can spot deception before it gets competent enough to evade our interpretability tools, but it’s not just “bada-bing bada-boom” exactly.
Ok this is going to be messy but let me try to convey my hunch for why randomization doesn’t seem very useful.
- Say I have an intervention that’s helpful, and has a baseline 1⁄4 probability. If I condition on this statement, I get 1 “unit of helpfulness”, and a 4x update towards manipulative AGI.
- Now let’s say I have four interventions like the one above, and I pick one at random. p(O | manipulative) = 1⁄4, which is the same as baseline, so I get one unit of helpfulness and no update towards manipulative AGI!
- BUT, the four interventions have to be mutually exclusive. Which means that if I’d done no simulation at all, I would’ve gotten my one unit of helpfulness anyway, since the four interventions cover all possible outcomes.
- Ok, well, what if my four interventions 1⁄8 baseline probability each, so only 50% total. Then I pick one at random, p(O | natural) = 1⁄8, p(O | manipulative) = 1⁄4, so I get a 2x update towards manipulative AGI. This is the same as if I’d just conditioned on the statement “one of my four interventions happens”, and let the randomization happen inside the simulation instead of outside. The total probability of that is 50%, so I get my one unit of helpfulness, at the cost of a 2x update.
Maybe the core thing here is a consequence of framing our conditions as giving us bits of search to get lottery outcomes that we like. Rolling the dice to determine what to condition on isn’t doing anything different from just using a weaker search condition—it gives up bits of search, and so it has to pay less.
Nice post!
perhaps this problem can be overcome by including checks for generalization during training, i.e., testing how well the program generalizes to various test distributions.
I don’t think this gets at the core difficulty of speed priors not generalizing well. Let’s we generate a bunch of lookup-table-ish things according to the speed prior, and then reject all the ones that don’t generalize to our testing set. The majority of the models that pass our check are going to be basically the same as the rest, plus whatever modification that causes them to pass with is most probable on the speed prior, i.e. the minimum number of additional entries in the lookup table to pass the training set.
Here’s maybe an argument that the generalization/deception tradeoff is unavoidable: according to the Mingard et al. picture of why neural nets generalize, it’s basically just that they are cheap approximations of rejection sampling with a simplicity prior. This generalizes by relying on the fact that reality is, in fact, simplicity biased—it’s just Ockham’s razor. On this picture of things, neural nets generalize exactly to the extent that they approximate a simplicity prior, and so any attempt to sample according to an alternative prior will lose generalization ability as it gets “further” from the True Distribution of the World.
Ah, so on this view, the endgame doesn’t look like
“make technical progress until the alignment tax is low enough that policy folks or other AI-risk-aware people in key positions will be able to get an unaware world to pay it”
But instead looks more like
“get the world to be aware enough to not bumble into an apocalypse, specifically by promoting rationality, which will let key decision-makers clear out the misaligned memes that keep them from seeing clearly”
Is that a fair summary? If so, I’m pretty skeptical of the proposed AI alignment strategy, even conditional on this strong memetic selection and orthogonality actually happening. It seems like this strategy requires pretty deeply influencing the worldview of many world leaders. That is obviously very difficult because no movement that I’m aware of has done it (at least, quickly), and I think they all would like to if they judged it doable. Importantly, the reduce-tax strategy requires clarifying and solving a complicated philosophical/technical problem, which is also very difficult. I think it’s more promising for the following reasons:
It has a stronger precedent (historical examples I’d reference include the invention of computability theory, the invention of information theory and cybernetics, and the adventures in logic leading up to Godel)
It’s more in line with rationalists’ general skill set, since the group is much more skewed towards analytical thinking and technical problem-solving than towards government/policy folks and being influential among those kinds of people
The number of people we would need to influence will go up as AGI tech becomes easier to develop, and every one is a single point of failure.
To be fair, these strategies are not in a strict either/or, and luckily use largely separate talent pools. But if the proposal here ultimately comes down to moving fungible resources towards the become-aware strategy and away from the technical-alignment strategy, I think I (mid-tentatively) disagree
The first enigma seems like it’s either very closely related or identical to Hume’s problem of induction. If that is a fair-rephrasing, then I think it’s not entirely true that the key problem is that the use of empiricism cannot be justified by empiricism or refuted by empiricism. Principles like “don’t believe in kludgy unwieldy things” and “empiricism is a good foundation for belief” can in fact be supported by empiricism—because those heuristics have worked well in the past, and helped us build houses and whatnot.
I think the key problem is that empiricism both supports and refutes the claim “I know empiricism works because empirically it’s always worked well in the past”. This statement is empirically supported because empiricism has worked well in the past, but it’s also circular, and circular reasoning has not generally worked well in the past.
This can also be re-phrased as a conflict between object-level and meta-reasoning. On the object level, empiricism supports empiricism. But on the meta level, empiricism rejects circular reasoning.
My best guess at mechanism:
Before, I was a person who prided myself on succeeding at marshmallow tests. This caused me to frame work as a thing I want to succeed on, and work too hard.
Then, I read Meaningful Rest and Replacing Guilt, and realized that often times I was working later to get more done that day, even though it would obviously be detrimental to the next day. This makes the reverse marshmallow test dynamic very intuitively obvious.
Now I am still a person who prides myself on my marshmallow prowess, but hopefully I’ve internalized an externality or something. Staying up late to work doesn’t feel Good and Virtuous, it feels Bad and like I’m knowingly Goodharting myself.
Note that this all still boils down to narrative-stuff. I’m nowhere near the level of zen that it takes to Just Pursue The Goal, with no intermediating narratives or drives based on self-image. I don’t think this patch has been particularly moved me towards that either, it’s just helpful for where I’m currently at.
Dang, I wish I had read this before the EA Forum’s creative writing contest closed. It makes a lot of sense that HPMOR could be valuable via this “first-person-optimizing-experience” mechanism—I had read it after reading the Sequences, so I was mostly looking for examples of rationality techniques and secret hidden Jedi knowledge.
Since HPMOR!Harry isn’t so much EA as transhumanist, I wonder if a first-person EA experience could be made interesting enough to be a useful story? I suppose the Comet King from Unsong is also kind of close to this niche, but not really described in first person or designed to be related to. This might be worth a stab...
Trying to figure out what’s being said here. My best guess is two major points:
Meta doesn’t work. Do the thing, stop trying to figure out systematic ways to do the thing better, they’re a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it’s what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it’s useless.
Memetic evolution dominates biological evolution for the same reason.
Faster mutation rate doesn’t just produce faster evolution—it also reduces the steady-state fitness. Complex machinery can’t reliably be evolved if pieces of it are breaking all the time. I’m mostly relying No Evolutions for Corporations or Nanodevices plus one undergrad course in evolutionary bio here.
Also, just empirically: memetic evolution produced civilization, social movements, Crusades, the Nazis, etc.
Thank you for pointing this out. I agree with the empirical observation that we’ve had some very virulent and impactful memes. I’m skeptical about saying that those were produced by evolution rather than something more like genetic drift, because of the mutation-rate argument. But given that observation, I don’t know if it matters if there’s evolution going on or not. What we’re concerned with is the impact, not the mechanism.
I think at this point I’m mostly just objecting to the aesthetic and some less-rigorous claims that aren’t really important, not the core of what you’re arguing. Does it just come down to something like:
“Ideas can be highly infectious and strongly affect behavior. Before you do anything, check for ideas in your head which affect your behavior in ways you don’t like. And before you try and tackle a global-scale problem with a small-scale effort, see if you can get an idea out into the world to get help.”
Another project I am excited about but probably am not going to get to for a while:
Investigating the possibility of a “steganographic snowball”. The basic case for hope in CoT authenticity is something like “even if the AI wants to deceive us, if we start off with strong CoT authenticity, then unless it’s smart enough to (within a single forward pass) invent a steganographic protocol too good for us to detect, then we can pin it down and it can’t ever secure a hidden channel to think misaligned thoughts in”. If it weren’t for this argument, I would be much less excited about trying to prevent an AI from thinking certain thoughts using CoT oversight, since you would have to directly fight the intelligence of the AI to do so. But if it can’t deploy its full intelligence to reasoning about how to escape, then we’re better off.The concern for this angle is as follows:
Can you really get down to 0 hidden channel bandwidth
Can very small hidden channels support a parallel CoT
In very small hidden channels, does reasoning degrade, or just slow down
Can the model use a parallel CoT to reason about how to secure more hidden channel bandwidth
I have some basic ideas for ways to measure these properties, and if all four show up positive, I would consider that significant evidence that a “steganographic snowball” is going to happen. That would mean that if your AI has any tendency to hide its thoughts from you, you won’t be able to stop it forever. More specifically, you’ll have to fight off the full intelligence of your AI if you want to prevent it from securing the amount of hidden channel bandwidth it wants.
There’s room for retreat to a “snowball would take a long time and we can extract work in the meantime” position, but that seems like a serious concession.
Thanks to Paul Colognese for brainstorming this stuff with me.
I don’t think I understand how the scorecard works. From:
[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.
And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?
If the scorecard is learned, then it needs a training signal from Steering. But if it’s useless at the start, it can’t provide a training signal. On the other hand, since the “ontology” of the Learning subsystem is learned-from-scratch, then it seems difficult for a hard-coded scorecard to do this translation task.
TLDR: if we model a human as a collection of sub-agents rather than single agent, how do we make normative claims about which sub-agents should or shouldn’t hammer down others? There’s no over-arching set of goals to evaluate against, and each sub-agent always wants to hammer down all the others.
If I’m interpreting things right, I think I agree with the descriptive claims here, but tentatively disagree with the normative ones. I agree that modeling humans as single agents is inaccurate, and a multi-agent model of some sort is better. I also agree that the Drowning Child parable emphasizes the conflict between two sub-agents, although I’m not sure it sets up one side against the other too strongly (I know some people for whom the Drowning Child conflict hammers down altruism).
What I have trouble with is thinking about how a multi-agent human “should” try to alter the weights of their sub-agents, or influence this “hammering” process. We can’t really ask the sub-agents for their opinion, since they’re always all in conflict with all the others, to varying degrees. If some event (like exposure to a thought experiment) forces a conflict between sub-agents to rise to confrontation, and one side or the other ends up winning out, that doesn’t have any intuitive normative consequences to me. In fact, it’s not clear to me how it could have normativity to it at all, since there’s no over-arching set of goals for it to be evaluated against.
I think so. But I’d want to sit down and prove something more rigorously before abandoning the strategy, because there may be times we can get value for free in situations more complicated than this toy example.
In general, I’m a bit unsure about how much of an interpretability advantage we get from slicing the model up into chunks. If the pieces are trained separately, then we can reason about each part individually based on its training procedure. In the optimistic scenario, this means that the computation happening in the part of the system labeled “world model” is actually something humans would call world modelling. This is definitely helpful for interpretability. But the alternative possibility is that we get one or more mesa-optimizers, which seems less interpretable.
Agree that this is definitely a plausible strategy, and that it doesn’t get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:
How did we get here?
If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side.
In early days, focus on the technical problem makes sense. When you are the only person in the world working on AGI, all the delay in the world won’t help unless the alignment problem gets solved. But we are working at very different margins nowadays.
There’s also an obvious trap which makes motivated reasoning really easy. Often, the first thing that occurs when thinking about slowing down AGI development is sabotage—maybe because this feels urgent and drastic? It’s an obviously bad idea, and maybe that lets us to motivated stopping.
Maybe the “technical/policy” dichotomy is keeping us from thinking of obvious ways we could be making the future much safer? The outreach org you propose seems like not really either. Would be interested in brainstorming other major ways to affect the world, but not gonna do that in this comment.
HEY! FTX! OVER HERE!!
You should submit this to the Future Fund’s ideas competition, even though it’s technically closed. I’m really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I’ve done a more detailed brainstorm.