jacob_cannell comments on Counterarguments to the basic AI x-risk case

jacob_cannell 17 Oct 2022 5:47 UTC
13 points
8
Interpretations

First a reply to interpretations of previous words:

I am not claiming that unconditional diffusion models trained on a face dataset optimize faciness. In fact I’m confused how you could possibly have arrived at that interpretation of my words, because I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility. The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

I hope we agree that a discriminator which is trained only to recognize good essays robustly probably does not contain enough information to generate good essays, for the same reasons that an image discriminator does not contain enough information to generate good images—because the discriminator only learns the boundaries of words/categories over images, not the more complex embedded distribution of realistic images.

Optimizing only for faciness via a discriminator does not work well—that’s the old deepdream approach. Optimizing only for “good essayness” probably does not work well either. These approaches do not actually get high utility.

So when you say ” I am specifically arguing that because diffusion models trained on a face dataset don’t optimize for faciness, they aren’t a fair comparison with the task of doing things that get high utility”, that just seems confused to me, because diffusion models do get high utility, and not via optimizing just for faciness (which results in low utility).

When you earlier said:

In other words, if you’re trying to write a really good essay, you don’t care what the highest likelihood essay from the distribution of human essays looks like, you care about what the essay that maxes out your essay-quality function is.

The obvious interpretation there still seems to be optimizing only for the discriminator objective—and I’m surprised you are surprised I interpreted that otherwise?. Especially when I replied that the only way to actually get a good essay is to sample from the distribution of essays conditioned on goodness—ie the distribution of good essays.

Anyway, here you are making a somewhat different point:

The essay example is claiming that if your goal is to write a really good essay, what matters is not your ability to write lots of typical essays, but your ability to tell what a good essay is robustly.

I still think this is not quite right, in that a diffusion model works by combining the ability to write typical essays with a discriminator to condition on good essays, such that both abilities matter, but I see your point is basically “the discriminator or utility function is the hard part for AGI”, and move on to the more cruxy part.

The Crux?

When I say “breaks” the objective, I mean reward hacking/reward gaming/goodharting it. I’m surprised that this wasn’t obvious. To me, most/all of alignment difficulty falls out of extreme optimization power being aimed at objectives that aren’t actually what we want.

Ok, so part of the problem here is we may be assuming different models for AGI. I am assuming a more brain-like pure ANN, which uses fully learned planning more like a diffusion planning model (which is closer to what the brain probably uses), rather than the more common older assumed approach of combining a learned world model and utility function with some explicit planning algorithm like MCT or whatever.

So there are several different optimization layers that can be scaled:
1. The agent optimizing the world (can scale up planning horizon, etc)
2. Optimizing/training the learned world/action/planning model(s)
3. Optimizing/training the learned discriminator/utility model
You can scale these independently but only within limits, and it probably doesn’t make much sense to differentially scale them too far.

But let’s just suppose for the moment that we completely abandon all competitiveness concerns and get rid of the learned discriminator entirely and replace it with the ground truth, an army of perfect human labellers.

I really wouldn’t call that the ground truth. The ground truth would be brain-sims (which is part of the rationale for brain-like AGI) combined with complete detailed understanding of the brain and especially its utility/planning system equivalents. That being said I am probably more optimistic about ’the army of perfect human labellers” approach.
1. Telling whether a world state is good is nontrivial. You can be easily tricked into thinking a world state is good when it isn’t. If you ask the AI to go do something really difficult, you need to make it at least as hard to trick you with a Potemkin village as the task you want it to do.
Why is the AI generating Potemkin villages? Deceptive alignment? I’m assuming use of proper sandbox sims to prevent deception. But I’m also independently optimistic about simpler more automatable altruistic utility functions like maximization of human empowerment.
1. Telling whether a plan leads to a good world state is nontrivial. You don’t have a perfect world model. You can’t tell very reliably whether a proposed plan leads to good outcomes.
I don’t see why imperfect planning is more likely to lead to bad rather than good outcomes, all else being equal, and regardless you don’t need anything near a perfect world model to match human intelligence. Furthermore the assumption that the world model isn’t good enough to be useful for utility evaluations contradicts the assumption of superintelligence.

I believe this is fatally flawed for planning, partly because this means we can’t really achieve world states that are very weird from the current perspective (and I claim most good futures also seem weird from the current perspective), and also that because of imperfect world modelling the actual domain you end up regularizing over is the domain of plans, which means you can’t do things that are too different from things that have been done before.

The world/action/planning model does need to be retrained on its own rollouts which will cause it to eventually learn to do things that are different and novel. Humans don’t seem to have much difficulty planning out wierd future world states.

jacob_cannell comments on Counterarguments to the basic AI x-risk case

Interpretations

The Crux?