Lauro Langosco
[Question] What alignment-related concepts should be better known in the broader ML community?
Here’s a few more questions about the same strategy:
If I understand correctly, the IG strategy is to learn a joint model for observations and actions , where , , and are video, actions, and proposed change to the Bayes net, respectively. Then we do inference using , where is optimized for predictive usefulness.
This fails because there’s no easy way to get from .
A simple way around this would be to learn instead, where if the diamond is in the vault and otherwise.
Is my understanding correct?
If so, I would guess that my simple workaround doesn’t count as a strategy because we can only use this to predict whether the diamond is in the vault (or some other set of questions that must be fixed at training time), as opposed to any question we want an answer to. Is this correct? Is there some other reason this wouldn’t count, or does it in fact count?
Proof: The only situation in which the iteration scheme does not update the decision boundary B is when we fail to find a predictor that does useful computation relative to E. By hypothesis, the only way this can happen is if E does not contain all of E0 or E = C. Since we start with E0 and only grow the easy set, it must be that E = C.
(emphasis mine)
To me it looks like the emphasized assumption (that it’s always possible to find a predictor that does useful computation) is the main source of your surprising result, as without it the iteration would not be possible.
That assumption strikes me as too strong; it is not realistic, since it requires that either information is created out of nowhere or that the easy set (plus maybe the training setup) contains all information about the full set. It also doesn’t seem necessary for a solution to ELK to satisfy this assumption (curious if you disagree?).
Alignment researchers, how useful is extra compute for you?
Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.
Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.
Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”
Curious if you disagree with any of this
Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn’t help the oracles coordinate as far as I can see).
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).
If it’s not myopic you’re right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.
Some reasons why a predictor wants to be a consequentialist
This sounds like what Fix #2 is saying, meant to be addressed in the paragraph ‘Third Problem’.
To paraphrase that paragraph: the model that best predicts the data is likely to be a consequentialist. This is because consequentialists are general in a way that heuristic or other non-consequentialist systems aren’t, and generality is strongly selected for in domains that are very hard.
Curious if you disagree with anything in particular in that paragraph or what I just said.
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it’s likely there are a bunch more possible failure modes here.
(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)
I think that most nontrivial choices of loss function would give rise to consequentialist systems, including the ones you write down here.
In the post I was assuming offline training, that is in your notation where is the distribution of the training data unaffected by the model. This seems even more tame than , but still dangerous because AGI can just figure out how to affect the data distribution ‘one-shot’ without having to trial-and-error learn how during training.
Hm I don’t think your objection applies to what I’ve written? I don’t assume anything about using a loss like . In the post I explicitly talk about offline training where the data distribution is fixed.
Taking a guess at where the disagreement lies, I think it’s where you say
And seems much more tame than L to me.
does not in fact look ‘tame’ (by which I mean safe to optimize) to me. I’m happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.
You haven’t given any instrumental convergence theorem
I wish :) I’m not nearly as certain of anything I say in this post as I’d be of a theorem!
But since we’re not doing that, there’s nothing to counteract the negative gradient that removes the inner optimizer.
During training, the inner optimizer has the same behavior as the benign model: while it’s still dumb it just doesn’t know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.
So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don’t know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.
Re: tameness of (using min cause L is a loss), some things that come to mind are
a) is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus .
b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)
(probably this list can be extended)
Minor comment on clarity: you don’t explicitly define relaxed adversarial training (it’s only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I’d suggest renaming the second section to ‘Formalizing relaxed adversarial training’, and within the section call it that instead of ‘Pauls approach’
(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)
The basics
It doesn’t prevent you from shutting it down
It doesn’t prevent you from modifying it
It doesn’t deceive or manipulate you
It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all
Myopia
Its objective is no more broad or long-term than is required to complete the task
In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator
Non-maximizing
It doesn’t maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
It doesn’t “optimize too hard” (not sure how to state this better)
Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like “make me a coffee”, or “tell me a likely outcome of this plan”) and then shuts down
It doesn’t optimize over causal pathways you don’t want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
It does not try to become more consequentialist with respect to its goals
for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior
No weird stuff
It doesn’t try to acausally cooperate or trade with far-away possible AIs
It doesn’t come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
It doesn’t attempt to simulate a misaligned intelligence
In fact it doesn’t simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task
Human imitation
Where possible, it should imitate a human that is trying to be corrigible
To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour (“what would a corrigible, unusually smart / competent / knowledgable human do?”)
To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or “optimize too hard”, and [other corrigibility desiderata]
Querying / robustness
Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
It will raise an exception, i.e. pause execution of its plans and notify its operators if
its instructions are unclear
it recognizes a flaw in its design
it sees a way in which corrigibility could be strengthened
in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
in the course of performing its task, its operators would predictably be deceived / misled about the state of the world
- 23 Jun 2022 3:32 UTC; 4 points) 's comment on Let’s See You Write That Corrigibility Tag by (
What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?
For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?
But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.
I’m confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?
(Except if you had expected even better scaling laws by now, but it didn’t sound like that was your argument?)
Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks!
...though now I’m confused about why we would assume that. Surely that assumption is wrong?
Humans are very constrained in terms of brain size and data, so we shouldn’t assume that these quantities are scaled optimally in some sense that generalizes to deep learning models.
Anyhow we don’t need to guess the amount of data the human brain needs: we can just estimate it directly, just like we estimate brain-parameter count.
To move to a more general complaint about the bio anchors paradigm: it never made much sense to assume that current scaling laws would hold; clearly scaling will change once we train on new data modalities; we know that human brains have totally different scaling laws than DL models; and an AGI architecture will again have different scaling laws. Going with the GPT-3 scaling law is a very shaky best guess.
So it seems weird to me to put so much weight on this particular estimate, such that someone figuring out how to scale models much more cheaply would update one in the direction of longer timelines! Surely the bio anchor assumptions cannot possibly be strong enough to outweigh the commonsense update of ‘whoa, we can scale much more quickly now’?
The only way that update makes sense is if you actually rely mostly on bio anchors to estimate timelines (rather than taking bio anchors to be a loose prior, and update off the current state and rate of progress in ML), which seems very wrong to me.
Thanks!
Would you consider this a valid counter to the third strategy (have humans adopt the optimal Bayes net using imitative generalization), as alternative to ontology mismatch?