Very interesting, thanks.
A few points on examples from humans in capacity-to-succeed-through-deception (tricking in the transcript):
It’s natural that we don’t observe anyone successfully doing this, since success entails not being identified as deceptive. This could involve secrecy, but more likely things like charisma and leverage of existing biases.
When making comparisons with very-smart-humans, I think it’s important to consider very-smart-across-all-mental-dimensions-humans (including charisma etc).
It may be that people have paths to high utility (which may entail happiness, enlightenment, meaning, contentment… rather than world domination) that don’t involve the risks of a deceptive strategy. If human utility were e.g. linear in material resources, things may look different.
Human deception is often kept in check by cost-of-punishments outweighing benefit-of-potential-success. With AI agents the space of meaningful punishments will likely look different.
Ah ok, if the honesty vow takes precedence. I still think it’s a difficult one in edge cases, but I don’t see effective resolutions that do better than using vows 2 and 3 to decide on those.
I’m not sure what’s the difference between “set of vows” and “policy”?
The point isn’t in choosing “set of vows” over “policy”, but rather in choosing “I make the set of vows...” over “Everything I do will be according to...”. You’re able to make the set of vows (albeit implicitly), and the vows themselves will have the optimal amount of wiggle-room, achievability, flexibility, emphasis on good faith… built in.
To say “Everything I do will be according to...” seems to set the bar unachievably high, since it just won’t be true. You can aim in that direction, but your actions won’t even usually be optimal w.r.t. that policy. (thoughts on trying-to-try notwithstanding, I do think vows that are taken seriously should at least be realistically possible to achieve)
To put it another way, to get the “Everything I do...” formulation to be equivalent to the “I make the set of vows...” formulation, I think the former would need to be self-referential—i.e. something like ”… according to the policy which is the KS solution… given its inclusion in this vow”. That self-reference will insert the optimal degree of wiggle-room etc.
I think you need either the extra indirection or the self-reference (or I’m confused, which is always possible :)).
It would explain at least a slight efficiency increase: presumably [some collection of factors] (SCoF) influences whether there’s a response or not. A priori you’d expect a smaller correlation of SCoF with SCoF-after-8-weeks than with SCoF-after-4-weeks.
Presumably the actual impact is larger than this would predict (at least without a better model of SCoF).
In any context where good faith isn’t to be expected (which I’d hope doesn’t apply here), bear in mind that there are exploits.
Agreed. It seems the right move should be to estimate the current net externalities (bearing in mind incentives to hide/publicise the negative/positive), and reward/punish in proportion to that.
Very interesting—and congratulations!
A few thoughts:It strikes me that the first vow will sometimes conflict with the second. If your idea is that any conflict with the second vow would be a (mild) information hazard, then ok—but I’m not sure what the first vow adds in this case.
Have you considered going meta?:”I make the set of vows determined by the Kalai-Smorodinski solution to the bargaining problem...”″...I expect these vows to be something like [original vows go here] but the former description is definitive.”This has the nice upside of automatically catching problems you haven’t considered, but not requiring you to be super-human. Specifically, the “Everything I do will be according to the policy...” clause just isn’t achievable. Committing to the set of vows such a policy would have you make is achievable (you might not follow them perfectly, but there’d automatically be a balance between achievability and other desiderata).
Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for f+ that:
We’re highly confident is correct.
Enables the model to decide which true things to output in general. (my (2) here)
It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to “How should I go about designing a chair?”).[Of course it’s not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]
I don’t think we can remedy this for values questions by including only data that we’re certain of. It seems to me that works for facts questions due to the structure of the world: it’s so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.
It’s not clear that anything analogous works for generalizing preferences (maybe?? but I’d guess not). I’d expect an f+ trained on [data we’re highly confident is correct] to generalize poorly to general open questions.
Similarly, in Paul’s setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:
The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.
It’s entirely plausible you can learn “correct language usage” in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement → natural_language_equivalent] mapping). I don’t think it’s plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) → natural_language_answer] mapping).
Again, perhaps I’m (not even) wrong, but I think the above accurately describes my current thinking.
Ok, the softer constraints make sense to me, thanks.
Using a debate with f+ assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with “answering honestly vs predicting human answers” and end up with “judging honestly vs predicting human judgments”.
While “Which answer is better, Alice’s or Bob’s?” is a closed question, learning to answer the general case still requires applying a full model of human values—so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I’m not really sure what we’d mean by an intended model for the judge).
But perhaps I’m missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn’t have analogous issues?
Ok, I think that makes some sense in so far as you’re softening the f+=f− constraint and training it in more open-ended conditions. I’m not currently clear where this gets us, but I’ll say more about that in my response to Paul.
However, I don’t see how you can use generalization from the kind of dataset where f+ and f− always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]I see honestly answering a question as a 2-step process (conceptually):1) Decide which things are true.2) Decide which true thing to output.
In the narrow case, we’re specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn’t learned anything that can generalize to (2).Step (2) is in part a function of human values, so we’d need to be giving it some human-values training signal for it to generalize.
[EDIT: I’ve just realized that I’m being very foolish here. The above suggests that learning (1) doesn’t necessarily generalize to (2). In no way does it imply that it can’t. I think the point I want to make is that an f+ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I’m implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I’m still unsure how to describe what we want: clearly we don’t trust Alice’s answers if she’s being blackmailed, but how about if she’s afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?It’s clear that the instrumental model just gives whatever response Alice would give here.I don’t know what the intended model should do; I don’t know what “honest answer” we’re looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.
Thanks for writing this up. It is useful to see a non-Paul perspective on the same ideas, both in terms of clarifying the approach, and eliminating a few of my confusions.A typo: After “or defined in my notation as”, you have M+ twice rather than M+ M−
I’ve not yet been through the details, but it’d be helpful if you’d clarify the starting point and scope a little, since I may well be misunderstanding you (and indeed Paul). In particular on this:
Specifically, f+ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding q as a logical statement and unembedding its answer in deduced_stmts.
My immediate thought is that in general question answering there is no unique honest unembedding. Much of answer formation is in deciding which information is most relevant, important, useful, tacitly assumed… (even assuming fixed world model and fixed logical deductions).So I assume that you have to mean a narrower context where e.g. the question specifies the logical form the answer must take and the answering human/model assigns values to pre-defined variables.
For a narrower setting, the gist of the post makes sense to me—but I don’t currently see how a solution there would address the more general problem. Is finding a prior that works for closed questions with unique honest answers sufficient?
The more general setting seems difficult as soon as you’re asking open questions.If you do apply the f+=f− constraint there, then it seems f+ must do hugely more than a simple unembedding from deductions. It’ll need to robustly select the same answer as a human from a huge set of honest answers, which seems to require something equivalent to predicting the human. At that point it’s not clear to me when exactly we’d want f+ to differ from f− in its later answers (there exist clear cases; I don’t see a good general rule, or how you’d form a robust dataset to learn a rule).To put it another way, [honest output to q from fixed world model] doesn’t in general uniquely define an answer until you know what the answerer believes the asker of q values.
Apologies if I’m stating the obvious: I’m probably confused somewhere, and wish to double-check my ‘obvious’ assumptions. Clarifications welcome.
Ok, that all makes sense, thanks.
...and is-correct basically just tests whether they are equal.
So here “equal” would presumably be “essentially equal in the judgement of complex process”, rather than verbatim equality of labels (the latter seems silly to me; if it’s not silly I must be missing something).
Just to check I’m understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.
I.e. the labelling process would be doing something like this:
# Return a pair of (simple, complex) labels for a given inputsimple_label = GenerateSimpleLabel(input)if is_correct(simple_label, input): return simple_label, simple_labelelse: return simple_label, GenerateComplexLabel(input)
# Return a pair of (simple, complex) labels for a given input
simple_label = GenerateSimpleLabel(input)
if is_correct(simple_label, input): return simple_label, simple_labelelse: return simple_label, GenerateComplexLabel(input)
Does that make sense?
A couple of typos: ”...we are only worried if the model [understands? knows?] the dynamics...”″...don’t collect training data in situations without [where?] strong adversaries are trying...”
Thanks, that’s very helpful. It still feels to me like there’s a significant issue here, but I need to think more. At present I’m too confused to get much beyond handwaving. A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I had been thinking too much lately of [isolated human] rather than [human process].
I agree the issue I want to point to isn’t precisely OOD generalisation. Rather it’s that the training data won’t be representative of the thing you’d like the system to learn: you want to convey X, and you actually convey [output of human process aiming to convey X]. I’m worried not about bias in the communication of X, but about properties of the generating process that can be inferred from the patterns of that bias.
It does seem hard to ensure you don’t end up OOD in a significant sense. E.g. if the content of a post-deployment question can sometimes be used to infer information about the questioner’s resource levels or motives.
The opportunity costs I was thinking about were in altruistic terms: where H has huge computational resources, or the questioner has huge resources to act in the world, [the most beneficial information H can provide] would often be better for the world than [good direct answer to the question]. More [persuasion by ML] than [extortion by ML].
If (part of) H would ever ideally like to use resources to output [beneficial information], but gives direct answers in order not to get thrown off the project, then (part of) H is deceptively aligned. Learning from a (partially) deceptively aligned process seems unsafe.
W.r.t. H’s making value calls, my worry isn’t that they’re asked to make value calls, but that every decision is an implicit value call (when you can respond with free text, at least).
I’m going to try writing up the core of my worry in more precise terms.It’s still very possible that any non-trivial substance evaporates under closer scrutiny.
I’d be interested in your thoughts on human motivation in HCH and amplification schemes.Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?
Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn’t relate to the question/task.[I don’t think you’ve addressed this at all recently—I’ve only come across specifying enlightened judgement precisely]
I’d appreciate if you could say if/where you disagree with the following kind of argument.I’d like to know what I’m missing:
Motivation seems like an eventual issue for imitative amplification. Even for an H who always attempted to give good direct answers to questions in training, the best models at predicting H’s output would account for differing levels of enthusiasm, focus, effort, frustration… based in part on H’s attitude to the question and the opportunity cost in answering it directly.
The ‘correct’ (w.r.t. alignment preservation) generalisation must presumably be in all circumstances to give the output that H would give. In scenarios where H wouldn’t directly answer the question (e.g. because H believed the value of answering the question were trivial relative to opportunity cost), this might include deception, power-seeking etc. Usually I’d expect high value true-and-useful information unrelated to the question; deception-for-our-own-good just can’t be ruled out.
If a system doesn’t always adapt to give the output H would, on what basis do we trust it to adapt in ways we would endorse? It’s unclear to me how we avoid throwing the baby out with the bathwater here.
Or would you expect to find Hs for whom such scenarios wouldn’t occur? This seems unlikely to me: opportunity cost would scale with capability, and I’d predict every H would have their price (generally I’m more confident of this for precisely the kinds of H I’d want amplified: rational, altruistic...).
If we can’t find such Hs, doesn’t this at least present a problem for detecting training issues?: if HCH may avoid direct answers or deceive you (for worthy-according-to-H reasons), then an IDA of that H eventually would too. At that point you’d need to distinguish [benign non-question-related information] and [benevolent deception] from [malign obfuscation/deception], which seems hard (though perhaps no harder than achieving existing oversight desiderata???).
Even assuming that succeeds, you wouldn’t end up with a general-purpose question-answerer or task-solver: you’d get an agent that does whatever an amplified [model predicting H-diligently-answering-training-questions] thinks is best. This doesn’t seem competitive across enough contexts.
...but hopefully I’m missing something.
That’s a good point, though I do still think you need the right motivation. Where you’re convinced you’re right, it’s very easy to skim past passages that are ‘obviously’ incorrect, and fail to question assumptions.(More generally, I do wonder what’s a good heuristic for this—clearly it’s not practical to constantly go back to first principles on everything; I’m not sure how to distinguish [this person is applying a poor heuristic] from [this person is applying a good heuristic to very different initial beliefs])
Perhaps the best would be a combination: a conversation which hopefully leaves you with the thought that you might be wrong, followed by the book to allow you to go into things on your own time without so much worry over losing face or winning.
Another point on the cause-for-optimism side is that being earnestly interested in knowing the truth is a big first step, and I think that description fits everyone mentioned so far.
I’d guess that reciprocal exchanges might work better for friends:I’ll read any m books you pick, so long as you read the n books I pick.
Less likely to get financial ick-factor, and it’s always possible that you’ll gain from reading the books they recommend.
Perhaps this could scale to public intellectuals where there’s either a feeling of trust or some verification mechanism (e.g. if the intellectual wants more people to read [some neglected X], and would willingly trade their time reading Y for a world where X were more widely appreciated).
Whether or not money is involved, I’m sceptical of the likely results for public intellectuals—or in general for people strongly attached to some viewpoint. The usual result seems to be a failure to engage with the relevant points. (perhaps not attacking head-on is the best approach: e.g. the asymmetrical weapons post might be a good place to start for Deutsch/Pinker)
Specifically, I’m thinking of David Deutsch speaking about AGI risk with Sam Harris: he just ends up telling a story where things go ok (or no worse than with humans), and the implicit argument is something like “I can imagine things going ok, and people have been incorrectly worried about things before, so this will probably be fine too”. Certainly Sam’s not the greatest technical advocate on the AGI risk side, but “I can imagine things going ok...” is a pretty general strategy.
The same goes for Steven Pinker, who spends nearly two hours with Stuart Russell on the FLI podcast, and seems to avoid actually thinking in favour of simply repeating the things he already believes. There’s quite a bit of [I can imagine things going ok...], [People have been wrong about downsides in the past...], and [here’s an argument against your trivial example], but no engagement with the more general points behind the trivial example.
Steven Pinker has more than enough intelligence to engage properly and re-think things, but he just pattern-matches any AI risk argument to [some scary argument that the future will be worse] and short-circuits to enlightenment-now cached thoughts. (to be fair to Steven, I imagine doing a book tour will tend to set related cached thoughts in stone, so this is a particularly hard case… but you’d hope someone who focuses on the way the brain works would realise this danger and adjust)
When you’re up against this kind of pattern-matching, I don’t think even the ideal book is likely to do much good. If two hours with Stuart Russell doesn’t work, it’s hard to see what would.
Unless I’ve confused myself badly (always possible!), I think either’s fine here. The | version just takes out a factor that’ll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+))
Since we’ll renormalise, common factors don’t matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.
Taking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don’t think something like “It’s directly clear that 9 OOM will almost certainly be enough, by a similar argument”.
Certainly if they do conclude anything like that, then it’s going to massively drop their odds on 9-12 too. However, I’d still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range.
On (1), I agree that the same goes for pretty-much any argument: that’s why it’s important. If you update without factoring in (some approximation of) your best judgement of the evidence’s impact on all hypotheses, you’re going to get the wrong answer. This will depend highly on your underlying model.
On the information content of the post, I’d say it’s something like “12 OOMs is probably enough (without things needing to scale surprisingly well)”. My credence for low OOM values is mostly based on worlds where things scale surprisingly well.
But this is a bit weird; my post didn’t talk about the <7 range at all, so why would it disproportionately rule out stuff in that range?
I don’t think this is weird. What matters isn’t what the post talks about directly—it’s the impact of the evidence provided on the various hypotheses. There’s nothing inherently weird about evidence increasing our credence in [TAI by +10OOM] and leaving our credence in [TAI by +3OOM] almost unaltered (quite plausibly because it’s not too relevant to the +3OOM case).
Compare the 1-2-3 coins example: learning y tells you nothing about the value of x. It’s only ruling out any part of the 1 outcome in the sense that it maintains [x_heads & something independent is heads], and rules out [x_heads & something independent is tails]. It doesn’t need to talk about x to do this.
You can do the same thing with the TAI first at k OOM case—call that Tk. Let’s say that your post is our evidence e and that e+ stands for [e gives a compelling argument against T13+].Updating on e+ you get the following for each k:Initial hypotheses: [Tk & e+], [Tk & e-]Final hypothesis: [Tk & e+]
So what ends up mattering is the ratio p[Tk | e+] : p[Tk | e-]I’m claiming that this ratio is likely to vary with k.
Specifically, I’d expect T1 to be almost precisely independent of e+, while I’d expect T8 to be correlated. My reason on the T1 is that I think something radically unexpected would need to occur for T1 to hold, and your post just doesn’t seem to give any evidence for/against that.I expect most people would change their T8 credence on seeing the post and accepting its arguments (if they’ve not thought similar things before). The direction would depend on whether they thought the post’s arguments could apply equally well to ~8 OOM as 12.
Note that I am assuming the argument ruling out 13+ OOM is as in the post (or similar).If it could take any form, then it could be a more or less direct argument for T1.
Overall, I’d expect most people who agree with the post’s argument to update along the following lines (but smoothly):T0 to Ta: low increase in credenceTa to Tb: higher increase in credenceTb+: reduced credence
with something like (0 < a < 6) and (4 < b < 13).I’m pretty sure a is going to be non-zero for many people.
[[ETA, I’m not claiming the >12 OOM mass must all go somewhere other than the <4 OOM case: this was a hypothetical example for the sake of simplicity. I was saying that if I had such a model (with zwomples or the like), then a perfectly good update could leave me with the same posterior credence on <4 OOM.In fact my credence on <4 OOM was increased, but only very slightly]]
First I should clarify that the only point I’m really confident on here is the “In general, you can’t just throw out the >12 OOM and re-normalise, without further assumptions” argument.
I’m making a weak claim: we’re not in a position of complete ignorance w.r.t. the new evidence’s impact on alternate hypotheses.
My confidence in any specific approach is much weaker: I know little relevant data.
That said, I think the main adjustment I’d make to your description is to add the possibility for sublinear scaling of compute requirements with current techniques. E.g. if beyond some threshold meta-learning efficiency benefits are linear in compute, and non-meta-learned capabilities would otherwise scale linearly, then capabilities could scale with the square root of compute (feel free to replace with a less silly example of your own).
This doesn’t require “We’ll soon get more ideas”—just a version of “current methods scale” with unlucky (from the safety perspective) synergies.
So while the “current methods scale” hypothesis isn’t confined to 7-12 OOMs, the distribution does depend on how things scale: a higher proportion of the 1-6 region is composed of “current methods scale (very) sublinearly”.
My p(>12 OOM | sublinear scaling) was already low, so my p(1-6 OOM | sublinear scaling) doesn’t get much of a post-update boost (not much mass to re-assign).My p(>12 OOM | (super)linear scaling) was higher, but my p(1-6 OOM | (super)linear scaling) was low, so there’s not too much of a boost there either (small proportion of mass assigned).
I do think it makes sense to end up with a post-update credence that’s somewhat higher than before for the 1-6 range—just not proportionately higher. I’m confident the right answer for the lower range lies somewhere between [just renormalise] and [don’t adjust at all], but I’m not at all sure where.
Perhaps there really is a strong argument that the post-update picture should look almost exactly like immediate renormalisation. My main point is that this does require an argument: I don’t think its a situation where we can claim complete ignorance over impact to other hypotheses (and so renormalise by default), and I don’t think there’s a good positive argument for [all hypotheses will be impacted evenly].