it doesn’t work if your goal is to find the optimal answer, but we hardly ever want to know the optimal answer, we just want to know a good-enough answer.
Also not an expert, but I think this is correct
When a bounded agent attempts a task, we observe some degree of success. But the degree of success depends on many factors that are not “part of” the agent—outside the Cartesian boundary that we (the observers) choose to draw for modeling purposes. These factors include things like power, luck, task difficulty, assistance, etc. If we are concerned with the agent as a learner and don’t consider knowledge as part of the agent, factors like knowledge, skills, beliefs, etc. are also externalized. Applied rationality is the result of attempting to distill this big complicated mapping from (agent, power, luck, task, knowledge, skills, beliefs, etc) → success down to just agent → success. This lets us assign each agent a one-dimensional score: “how well do you achieve goals overall?” Note that for no-free-lunch reasons, this already-fuzzy thing is further fuzzified by considering tasks according to the stuff the observer cares about somehow.
Applied rationality is a property of a bounded agent, which attempts to describe how successful that agent tends to be when you throw tasks at it, while controlling for both “environmental” factors such as luck and “epistemic” factors such as beliefs.
In this framing, it’s pretty easy to define epistemic rationality analogously compressing from everything → prediction loss to just agent → prediction loss.
However, in retrospect I think the definition I gave here is pretty identical to how I would have defined “intelligence”, just without reference to the “mapping broad start distribution to narrow outcome distribution” idea (optimization power) that I usually associate with that term. If anyone could clarify specifically the difference between applied rationality and intelligence, I would be interested.
Maybe you also have to control for “computational factors” like raw processing power, or something? But then what’s left inside the Cartesian boundary? Just the algorithm? That seems like it has potential, but still feels messy.
This leans a bit close to the pedantry side, but the title is also a bit strange when taken literally. Three useful types (of akrasia categories)? Types of akrasia, right, not types of categories?
That said, I do really like this classification! Introspectively, it seems like the three could have quite distinct causes, so understanding which category you struggle with could be important for efforts to fix.
Props for first post!
Trying to figure out what’s being said here. My best guess is two major points:
Meta doesn’t work. Do the thing, stop trying to figure out systematic ways to do the thing better, they’re a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it’s what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it’s useless.
Ah, gotcha. I think the post is fine, I just failed to read.
If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:
Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM’s aren’t trained to simulate humans, they’re trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really value something else.
Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
I worry that I still might not understand your question, because I don’t see how fragility of value and orthogonality come into this?
The key thing here seems to be the difference between understanding a value and having that value. Nothing about the fragile value claim or the Orthogonality thesis says that the main blocker is AI systems failing to understand human values. A superintelligent paperclip maximizer could know what I value and just not do it, the same way I can understand what the paperclipper values and choose to pursue my own values instead.
Your argument is for LLM’s understanding human values, but that doesn’t necessarily have anything to do with the values that they actually have. It seems likely that their actual values are something like “predict text accurately”, and this requires understanding human values but not adopting them.
now this is how you win the first-ever “most meetings” prize
Agree that this is definitely a plausible strategy, and that it doesn’t get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:
How did we get here?
If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side.
In early days, focus on the technical problem makes sense. When you are the only person in the world working on AGI, all the delay in the world won’t help unless the alignment problem gets solved. But we are working at very different margins nowadays.
There’s also an obvious trap which makes motivated reasoning really easy. Often, the first thing that occurs when thinking about slowing down AGI development is sabotage—maybe because this feels urgent and drastic? It’s an obviously bad idea, and maybe that lets us to motivated stopping.
Maybe the “technical/policy” dichotomy is keeping us from thinking of obvious ways we could be making the future much safer? The outreach org you propose seems like not really either. Would be interested in brainstorming other major ways to affect the world, but not gonna do that in this comment.
HEY! FTX! OVER HERE!!
You should submit this to the Future Fund’s ideas competition, even though it’s technically closed. I’m really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I’ve done a more detailed brainstorm.
I don’t think I understand how the scorecard works. From:
[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.
And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?
If the scorecard is learned, then it needs a training signal from Steering. But if it’s useless at the start, it can’t provide a training signal. On the other hand, since the “ontology” of the Learning subsystem is learned-from-scratch, then it seems difficult for a hard-coded scorecard to do this translation task.
this is great,thanks!
What do you think about the effectiveness of the particular method of digital decluttering recommended by Digital Minimalism? What modifications would you recommend? Ideal duration?
One reason I have yet to do a month-long declutter is because I remember thinking something like “this process sounds like something Cal Newport just kinda made up and didn’t particularly test, my own methods that I think of for me will probably better than Cal’s method he thought of for him”.
So far my own methods have not worked.
Memetic evolution dominates biological evolution for the same reason.
Faster mutation rate doesn’t just produce faster evolution—it also reduces the steady-state fitness. Complex machinery can’t reliably be evolved if pieces of it are breaking all the time. I’m mostly relying No Evolutions for Corporations or Nanodevices plus one undergrad course in evolutionary bio here.
Also, just empirically: memetic evolution produced civilization, social movements, Crusades, the Nazis, etc.
Thank you for pointing this out. I agree with the empirical observation that we’ve had some very virulent and impactful memes. I’m skeptical about saying that those were produced by evolution rather than something more like genetic drift, because of the mutation-rate argument. But given that observation, I don’t know if it matters if there’s evolution going on or not. What we’re concerned with is the impact, not the mechanism.
I think at this point I’m mostly just objecting to the aesthetic and some less-rigorous claims that aren’t really important, not the core of what you’re arguing. Does it just come down to something like:
“Ideas can be highly infectious and strongly affect behavior. Before you do anything, check for ideas in your head which affect your behavior in ways you don’t like. And before you try and tackle a global-scale problem with a small-scale effort, see if you can get an idea out into the world to get help.”
I think we’re seeing Friendly memetic tech evolving that can change how influence comes about.
Wait, literally evolving? How? Coincidence despite orthogonality? Did someone successfully set up an environment that selects for Friendly memes? Or is this not literally evolving, but more like “being developed”?
The key tipping point isn’t “World leaders are influenced” but is instead “The Friendly memetic tech hatches a different way of being that can spread quickly.” And the plausible candidates I’ve seen often suggest it’ll spread superexponentially.
Whoa! I would love to hear more about these plausible candidates.
There’s insufficient collective will to do enough of the right kind of alignment research.
I parse this second point as something like “alignment is hard enough that you need way more quality-adjusted research-years (QARY’s?) than the current track is capable of producing. This means that to have any reasonable shot at success, you basically have to launch a Much larger (but still aligned) movement via memetic tech, or just pray you’re the messiah and can singlehandedly provide all the research value of that mass movement.”. That seems plausible, and concerning, but highly sensitive to difficulty of alignment problem—which I personally have practically zero idea how to forecast.
Ah, so on this view, the endgame doesn’t look like
“make technical progress until the alignment tax is low enough that policy folks or other AI-risk-aware people in key positions will be able to get an unaware world to pay it”
But instead looks more like
“get the world to be aware enough to not bumble into an apocalypse, specifically by promoting rationality, which will let key decision-makers clear out the misaligned memes that keep them from seeing clearly”
Is that a fair summary? If so, I’m pretty skeptical of the proposed AI alignment strategy, even conditional on this strong memetic selection and orthogonality actually happening. It seems like this strategy requires pretty deeply influencing the worldview of many world leaders. That is obviously very difficult because no movement that I’m aware of has done it (at least, quickly), and I think they all would like to if they judged it doable. Importantly, the reduce-tax strategy requires clarifying and solving a complicated philosophical/technical problem, which is also very difficult. I think it’s more promising for the following reasons:
It has a stronger precedent (historical examples I’d reference include the invention of computability theory, the invention of information theory and cybernetics, and the adventures in logic leading up to Godel)
It’s more in line with rationalists’ general skill set, since the group is much more skewed towards analytical thinking and technical problem-solving than towards government/policy folks and being influential among those kinds of people
The number of people we would need to influence will go up as AGI tech becomes easier to develop, and every one is a single point of failure.
To be fair, these strategies are not in a strict either/or, and luckily use largely separate talent pools. But if the proposal here ultimately comes down to moving fungible resources towards the become-aware strategy and away from the technical-alignment strategy, I think I (mid-tentatively) disagree
Putting this in a separate comment, because Reign of Terror moderation scares me and I want to compartmentalize. I am still unclear about the following things:
Why do we think memetic evolution will produce complex/powerful results? It seems like the mutation rate is much, much higher than biological evolution.
Valentine describes these memes as superintelligences, as “noticing” things, and generally being agents. Are these superintelligences hosted per-instance-of-meme, with many stuffed into each human? Or is something like “QAnon” kind of a distributed intelligence, doing its “thinking” through social interactions? Both of these models seem to have some problems (power/speed), so maybe something else?
Misaligned (digital) AGI doesn’t seem like it’ll be a manifestation of some existing meme and therefore misaligned, it seems more like it’ll just be some new misaligned agent. There is no highly viral meme going around right now about producing tons of paperclips.
My attempt to break down the key claims here:
The internet is causing rapid memetic evolution towards ideas which stick in people’s minds, encourage them to take certain actions, especially ones that spread the idea. Ex: wokism, Communism, QAnon, etc
These memes push people who host them (all of us, to be clear) towards behaviors which are not in the best interests of humanity, because Orthogonality Thesis
The lack of will to work on AI risk comes from these memes’ general interference with clarity/agency, plus selective pressure to develop ways to get past “immune” systems which allow clarity/agency
Before you can work effectively on AI stuff, you have to clear out the misaligned memes stuck in your head. This can get you the clarity/agency necessary, and make sure that (if successful) you actually produce AGI aligned with “you”, not some meme
The global scale is too big for individuals—we need memes to coordinate us. This is why we shouldn’t try and just solve x-risk, we should focus on rationality, cultivating our internal meme garden, and favoring memes which will push the world in the direction we want it to go
So, what it sounds like to me is that you at least somewhat buy a couple object-level moral arguments for veganism, but also put a high confidence in some variety of moral anti-realism which undermines those arguments. There are two tracks of reasoning I would consider here.
First: if anti-realism is correct, it doesn’t matter what we do. If anti-realism is not correct, then it seems like we shouldn’t eat animals. Unless we’re 100% confident in the anti-realism, it seems like we shouldn’t eat animals. Note that there are a couple difficulties with this kind of view—some sticking points with stating it precisely, and the pragmatic difficulty of letting a tiny sliver of credence drive your actions.
Second: even if morals aren’t real, values still are real. Just as a purely descriptive matter, you as a homo sapiens probably have some values, even if there isn’t some privileged set of values that’s “correct”. Anti-realism claims tend to sneak in a connotations roughly of the form “if morals aren’t real, then I should just do whatever I want”—where “whatever I want” looks sort of like a cartoon Ayn Rand on a drunken power trip. But the whole thing about anti-realism is that there are no norms about what you should/shouldn’t do. If you want to, you could still be a saint-as-traditionally-defined. So which world do you prefer, not just based on what’s “morally correct”, but based on your own values: the world with meat at the cost of animal suffering, or the world without? Recommended reading on this topic from E-Yudz: What Would You Do Without Morality?
This might not work depending on the details of how “information” is specified in these examples, but would this model of abstractions consider “blob of random noise” a good abstraction?
On the one hand, different blobs of random noise contain no information about each other on a particle level—in fact, they contain no information about anything on a particle level, if the noise is “truly” random. And yet they seem like a natural category, since they have “higher-level properties” in common, such as unpredictability and idk maybe mean/sd of particle velocities or something.
This is basically my attempt to produce an illustrative example for my worry that mutual information might not be sufficient to capture the relationships between abstractions that make them good abstractions, such as “usefulness” or other higher-level properties.
unlike other technologies, an AI disaster might not wait around for you to come clean it up
I think this piece is extremely important, and I would have put it in a more central place. The whole “instrumental goal preservation” argument makes AI risk very different from the knife/electricity/car analogies. It means that you only get one shot, and can’t rely on iterative engineering. Without that piece, the argument is effectively (but not exactly) considering only low-stakes alignment.
In fact, I think if we get rid of this piece of the alignment problem, basically all of the difficulty goes away. If you can always try again after something goes wrong, then if a solution exists you will always find it eventually. This piece seems like much of what makes the difference between “AI could potentially cause harm” and “AI could potentially be the most important problem in the world”. And I think even the most bullish techno-optimist probably won’t deny the former claim if you press them on it.
Might follow this up with a post?
Another minor note: very last link, to splendidtable, seems to include an extra comma at the end of the link which makes it 404