Lauro Langosco

Karma: 538

https://www.laurolangosco.com/

[Question] What alignment-related concepts should be better known in the broader ML community?

Lauro Langosco9 Dec 2021 20:44 UTC

6 points

4 comments1 min readLW link

Lauro Langosco 3 Feb 2022 12:58 UTC
1 point
on: Prizes for ELK proposals
Would you consider this a valid counter to the third strategy (have humans adopt the optimal Bayes net using imitative generalization), as alternative to ontology mismatch?

Counter: In the worst case, imitative generalization / learning the human prior is not competitive. In particular, it might just be harder for a model to match the human inference $y ∣ x, Z$ than to simply learn $y ∣ x$ . Here $Z$ is the set of instructions as in learning the prior (I think in the context of ELK $Z$ would be the proposed change to the human Bayes net?)

Lauro Langosco 3 Feb 2022 14:44 UTC
1 point
in reply to: Lauro Langosco’s comment on: Prizes for ELK proposals
Here’s a few more questions about the same strategy:

If I understand correctly, the IG strategy is to learn a joint model for observations and actions $p_{θ} (v, a; Z)$ , where $v$ , $a$ , and $Z$ are video, actions, and proposed change to the Bayes net, respectively. Then we do inference using $p_{θ} (v, a; Z^{*})$ , where $Z^{*}$ is optimized for predictive usefulness.

This fails because there’s no easy way to get $P (diamond is in the vault)$ from $p_{θ}$ .

A simple way around this would be to learn $p_{θ} (v, a, y; Z)$ instead, where $y = 1$ if the diamond is in the vault and $0$ otherwise.
1. Is my understanding correct?
2. If so, I would guess that my simple workaround doesn’t count as a strategy because we can only use this to predict whether the diamond is in the vault (or some other set of questions that must be fixed at training time), as opposed to any question we want an answer to. Is this correct? Is there some other reason this wouldn’t count, or does it in fact count?

Lauro Langosco 19 Feb 2022 10:17 UTC
3 points
on: Implications of automated ontology identification

Proof: The only situation in which the iteration scheme does not update the decision boundary B is when we fail to find a predictor that does useful computation relative to E. By hypothesis, the only way this can happen is if E does not contain all of E0 or E = C. Since we start with E0 and only grow the easy set, it must be that E = C.

(emphasis mine)

To me it looks like the emphasized assumption (that it’s always possible to find a predictor that does useful computation) is the main source of your surprising result, as without it the iteration would not be possible.

That assumption strikes me as too strong; it is not realistic, since it requires that either information is created out of nowhere or that the easy set (plus maybe the training setup) contains all information about the full set. It also doesn’t seem necessary for a solution to ELK to satisfy this assumption (curious if you disagree?).

Alignment researchers, how useful is extra compute for you?

Lauro Langosco19 Feb 2022 15:35 UTC

8 points

4 comments1 min readLW link

Lauro Langosco 20 Feb 2022 11:40 UTC
1 point
in reply to: gwern’s comment on: Alignment researchers, how useful is extra compute for you?
Thanks for the rec! I knew TRC was awesome but wasn’t aware you could get that much compute.

Still, beyond short-term needs it seems like this is a risky strategy. TRC is basically a charity project that AFAIK could be shut down at any time.

Overall this updates me towards “we should very likely do the GCP funding thing. If this works out fine, setting up a shared cluster is much less urgent. A shared cluster still seems like the safer option in the mid to long term, if there is enough demand for it to be worthwhile”

Curious if you disagree with any of this

Lauro Langosco 14 Apr 2022 17:15 UTC
1 point
in reply to: TLW’s comment on: Some reasons why a predictor wants to be a consequentialist
Hm I still think it works? All oracles assume null outputs from all oracles including themselves. Once a new oracle is built it is considered part of this set of null-output-oracles. (There are other hitches like, Oracle1 will predict that Oracle2 will never be built, because why would humans build a machine that only ever gives null outputs. But this doesn’t help the oracles coordinate as far as I can see).

Lauro Langosco 15 Apr 2022 13:19 UTC
2 points
in reply to: TLW’s comment on: Some reasons why a predictor wants to be a consequentialist
Ah yes, I missed that the oracle needs to be myopic, i.e. care only about the next prediction. I edited my definition of counterfactual oracle to include this (I think this is standard, as Stuart Armstrongs paper also assumes myopia).

If it’s not myopic you’re right that it might help construct a misaligned system, or otherwise take over the world. I think that myopia is enough to prevent this though: If Oracle1 cares only about the current prediction, then there is no incentive for it to construct Oracle2, since Oracle2 can only help in future episodes.

Some reasons why a predictor wants to be a consequentialist

Lauro Langosco15 Apr 2022 15:02 UTC

23 points

16 comments5 min readLW link

Lauro Langosco 15 Apr 2022 16:20 UTC
2 points
in reply to: tailcalled’s comment on: Some reasons why a predictor wants to be a consequentialist
This sounds like what Fix #2 is saying, meant to be addressed in the paragraph ‘Third Problem’.

To paraphrase that paragraph: the model that best predicts the data is likely to be a consequentialist. This is because consequentialists are general in a way that heuristic or other non-consequentialist systems aren’t, and generality is strongly selected for in domains that are very hard.

Curious if you disagree with anything in particular in that paragraph or what I just said.

Lauro Langosco 15 Apr 2022 19:35 UTC
2 points
in reply to: TLW’s comment on: Some reasons why a predictor wants to be a consequentialist
Yeah this seems right! :) I am assuming no one ever inspects a partial ouput. This does seem risky, and it’s likely there are a bunch more possible failure modes here.

(Btw, thanks for this exchange; just wanted to note that it was valuable for me and made me notice some mistakes in how I was thinking about oracles)

Lauro Langosco 19 Apr 2022 15:11 UTC
1 point
in reply to: tailcalled’s comment on: Some reasons why a predictor wants to be a consequentialist
I think that most nontrivial choices of loss function would give rise to consequentialist systems, including the ones you write down here.

In the post I was assuming offline training, that is in your notation $L (μ) = E_{P} [L (x, μ (y)]$ where $P$ is the distribution of the training data unaffected by the model. This seems even more tame than $L_{m} (μ)$ , but still dangerous because AGI can just figure out how to affect the data distribution ‘one-shot’ without having to trial-and-error learn how during training.

Lauro Langosco 20 Apr 2022 12:55 UTC
1 point
in reply to: tailcalled’s comment on: Some reasons why a predictor wants to be a consequentialist
Hm I don’t think your objection applies to what I’ve written? I don’t assume anything about using a loss like $L$ . In the post I explicitly talk about offline training where the data distribution is fixed.

Taking a guess at where the disagreement lies, I think it’s where you say

And $L^{*}$ seems much more tame than L to me.

$L^{*}$ does not in fact look ‘tame’ (by which I mean safe to optimize) to me. I’m happy to explain why, but without seeing your reasoning behind the quoted statement I can only rehash the things I say in the post.

You haven’t given any instrumental convergence theorem

I wish :) I’m not nearly as certain of anything I say in this post as I’d be of a theorem!

Lauro Langosco 3 May 2022 22:10 UTC
1 point
in reply to: tailcalled’s comment on: Some reasons why a predictor wants to be a consequentialist

But since we’re not doing that, there’s nothing to counteract the negative gradient that removes the inner optimizer.

During training, the inner optimizer has the same behavior as the benign model: while it’s still dumb it just doesn’t know how to do better; when it becomes smarter and reaches strategic awareness it will be deceptive.

So training does not select for a benign model over a consequentialist one (or at least it does not obviously select for a benign model; I don’t know how the inductive biases will work out here). Once the consequentialist acts and takes over the training process it is already too late.

Re: tameness of $L^{*} (μ) = L_{μ} (μ) - {min}_{m} L_{μ} (m)$ (using min cause L is a loss), some things that come to mind are

a) $L^{*}$ is always larger than zero, so it can be minimized by a strategy that takes over the input channel and induces random noise so no strategy can do better than random, thus $L_{μ} (μ) \approx {min}_{m} L_{μ} (m)$ .

b) Depending on which model class the min is taken over, the model can get less than zero loss by hacking its environment to get more compute (thus escaping the model class in the min)

(probably this list can be extended)

Lauro Langosco 9 Jun 2022 21:02 UTC
LW: 6 AF: 5
AF
on: Relaxed adversarial training for inner alignment
Minor comment on clarity: you don’t explicitly define relaxed adversarial training (it’s only mentioned in the title and the conclusion), which is a bit confusing for someone coming across the term for the first time. Since this is the current reference post for RAT I think it would be nice if you did this explicitly; for example, I’d suggest renaming the second section to ‘Formalizing relaxed adversarial training’, and within the section call it that instead of ‘Pauls approach’

Lauro Langosco 22 Jun 2022 20:10 UTC
LW: 19 AF: 7
8
AF
on: Let’s See You Write That Corrigibility Tag
(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics
- It doesn’t prevent you from shutting it down
- It doesn’t prevent you from modifying it
- It doesn’t deceive or manipulate you
- It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
- It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
- If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all
Myopia
- Its objective is no more broad or long-term than is required to complete the task
- In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
- It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator
Non-maximizing
- It doesn’t maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
- It doesn’t “optimize too hard” (not sure how to state this better)
  - Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
- Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like “make me a coffee”, or “tell me a likely outcome of this plan”) and then shuts down
- It doesn’t optimize over causal pathways you don’t want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
- It does not try to become more consequentialist with respect to its goals
  - for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior
No weird stuff
- It doesn’t try to acausally cooperate or trade with far-away possible AIs
- It doesn’t come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
- It doesn’t attempt to simulate a misaligned intelligence
- In fact it doesn’t simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task
Human imitation
- Where possible, it should imitate a human that is trying to be corrigible
- To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
- When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour (“what would a corrigible, unusually smart / competent / knowledgable human do?”)
- To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or “optimize too hard”, and [other corrigibility desiderata]
Querying / robustness
- Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
- It will raise an exception, i.e. pause execution of its plans and notify its operators if
  - its instructions are unclear
  - it recognizes a flaw in its design
  - it sees a way in which corrigibility could be strengthened
  - in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
  - in the course of performing its task, its operators would predictably be deceived / misled about the state of the world
What links here?
- anonymousaisafety's comment on Let’s See You Write That Corrigibility Tag by Eliezer Yudkowsky (23 Jun 2022 3:32 UTC; 4 points)

Lauro Langosco 12 Jul 2022 9:58 UTC
LW: 4 AF: 4
0
AF
on: On how various plans miss the hard bits of the alignment challenge
What would make you change your mind about robustness of behavior (or interpretability of internal representations) through the sharp left turn? Or about the existence of such a sharp left turn, as opposed to smooth scaling of ability to learn in-context?

For example, would you change your mind if we found smooth scaling laws for (some good measure of) in-context learning?

Lauro Langosco 3 Aug 2022 12:57 UTC
LW: 7 AF: 4
2
AF
on: Two-year update on my personal AI timelines

But in my report I arrive at a forecast by fixing a model size based on estimates of brain computation, and then using scaling laws to estimate how much data is required to train a model of that size. The update from Chinchilla is then that we need more data than I might have thought.

I’m confused by this argument. The old GPT-3 scaling law is still correct, just not compute-optimal. If someone wanted to, they could still go on using the old scaling law. So discovering better scaling can only lead to an update towards shorter timelines, right?

(Except if you had expected even better scaling laws by now, but it didn’t sound like that was your argument?)

Lauro Langosco 5 Aug 2022 20:23 UTC
LW: 3 AF: 2
1
AF
in reply to: kave’s comment on: Two-year update on my personal AI timelines
Gotcha, this makes sense to me now, given the assumption that to get AGI we need to train a P-parameter model on the optimal scaling, where P is fixed. Thanks!

...though now I’m confused about why we would assume that. Surely that assumption is wrong?
- Humans are very constrained in terms of brain size and data, so we shouldn’t assume that these quantities are scaled optimally in some sense that generalizes to deep learning models.
- Anyhow we don’t need to guess the amount of data the human brain needs: we can just estimate it directly, just like we estimate brain-parameter count.
To move to a more general complaint about the bio anchors paradigm: it never made much sense to assume that current scaling laws would hold; clearly scaling will change once we train on new data modalities; we know that human brains have totally different scaling laws than DL models; and an AGI architecture will again have different scaling laws. Going with the GPT-3 scaling law is a very shaky best guess.

So it seems weird to me to put so much weight on this particular estimate, such that someone figuring out how to scale models much more cheaply would update one in the direction of longer timelines! Surely the bio anchor assumptions cannot possibly be strong enough to outweigh the commonsense update of ‘whoa, we can scale much more quickly now’?

The only way that update makes sense is if you actually rely mostly on bio anchors to estimate timelines (rather than taking bio anchors to be a loose prior, and update off the current state and rate of progress in ML), which seems very wrong to me.

Lauro Langosco 8 Aug 2022 17:11 UTC
LW: 1 AF: 1
0
AF
in reply to: Rob Bensinger’s comment on: On how various plans miss the hard bits of the alignment challenge
Thanks!

Lauro Langosco

[Question] What al­ign­ment-re­lated con­cepts should be bet­ter known in the broader ML com­mu­nity?

Align­ment re­searchers, how use­ful is ex­tra com­pute for you?

Some rea­sons why a pre­dic­tor wants to be a consequentialist

[Question] What alignment-related concepts should be better known in the broader ML community?

Alignment researchers, how useful is extra compute for you?

Some reasons why a predictor wants to be a consequentialist