dxu(David Xu)

Karma: 4,550

dxu 12 Jun 2024 9:16 UTC
2 points
0
in reply to: Ebenezer Dukakis’s comment on: My AI Model Delta Compared To Yudkowsky

I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]

I think I don’t understand what you’re imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?

[I also think I don’t understand why you make the bracketed claim you do, but perhaps hashing that out isn’t a conversational priority.]

As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it.

It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology. However, this is not the setup we’re in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it’s also not the setup of the thought experiment, which (after all) is about precisely what happens when you can’t read things out of the model’s internal ontology, because it’s too alien to be interpreted.

In other words: “you” don’t read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don’t know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we’re having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn’t know how to tell it explicitly to do whatever it’s doing.

(The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn’t themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.)

For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best.

Yes, this is a reasonable description to my eyes. Moreover, I actually think it maps fairly well to the above description of how a QFT-style model might be trained to predict the next token of some body of text; in your terms, this is possible specifically because the text already exists, and retrodictions of that text can be graded based on how well they compare against the ground truth.

Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t).

This, on the other hand, doesn’t sound right to me. Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those? Later tokens are highly conditionally dependent on previous tokens, in a way that’s much closer to a time series than to some kind of IID process. Possibly part of the disconnect is that we’re imagining different applications entirely—which might also explain our differing intuitions w.r.t. deployment?

The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.

Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting? And that in that situation, the source of (retrodictable) ground truth that was present during training—whether that was a book, a philosopher, or something else—will be absent?

If we do actually agree about that, then that distinction is really all I’m referring to! You can think of it as training set versus test set, to use a more standard ML analogy, except in this case the “test set” isn’t labeled at all, because no one labeled it in advance, and also it’s coming in from an unpredictable outside world rather than from a folder on someone’s hard drive.

Why does that matter? Well, because then we’re essentially at the mercy of the model’s generalization properties, in a way we weren’t while it was retrodicting the training set (or even the validation set, if one of those existed). If it gets anything wrong, there’s no longer any training signal or gradient to penalize it for being “wrong”—so the only remaining question is, just how likely is it to be “wrong”, after being trained for however long it was trained?

And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?

Maybe I was predicting the soundwaves passing through a particularly region of air in the room he was located—or perhaps I was predicting the pattern of physical transistors in the segment of memory of a particular computer containing his works. Those physical locations in spacetime still exist, and now that I’m deployed, I continue to make predictions using those as my referent—except, the encodings I’m predicting there no longer resemble anything like coherent moral philosophy, or coherent anything, really.

The philosopher has left the room, or the computer’s memory has been reconfigured—so what exactly are the criteria by which I’m supposed to act now? Well, they’re going to be something, presumably—but they’re not going to be something explicit. They’re going to be something implicit to my QFT ontology, something that—back when the philosopher was there, during training—worked in tandem with the specifics of his presence, and the setup involving him, to produce accurate retrodictions of his judgements on various matters.

Now that that’s no longer the case, those same criteria describe some mathematical function that bears no meaningful correspondence to anything a human would recognize, valuable or not—but the function exists, and it can be maximized. Not much can be said about what maximizing that function might result in, except that it’s unlikely to look anything like “doing right according to the philosopher”.

That’s why the QFT example is important. A more plausible model, one that doesn’t think natively in terms of quantum amplitudes, permits the possibility of correctly compressing what we want it to compress—of learning to retrodict, not some strange physical correlates of the philosopher’s various motor outputs, but the actual philosopher’s beliefs as we would understand them. Whether that happens, or whether a QFT-style outcome happens instead, depends in large part on the inductive biases of the model’s architecture and the training process—inductive biases on which the natural abstraction hypothesis asserts a possible constraint.

dxu 11 Jun 2024 18:54 UTC
4 points
0
in reply to: Ebenezer Dukakis’s comment on: My AI Model Delta Compared To Yudkowsky

I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’.

Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).

If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?

Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system’s time-evolution with alacrity.

(Yes, this is computationally intractable, but we’re already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model’s native ontology completely diverges from our own.)

So we need not assume that predicting “the genius philosopher” is a core task. It’s enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:

Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?

Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don’t have reliable access to its internal world-model, and even if we did we’d see blobs of amplitude and not much else. There’s no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.

What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn’t really matter for the purposes of our story) on the model’s outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.

But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn’t present to provide their actual feedback, as was never the case during training? What becomes the model’s referent then? We’d like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model’s behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model’s native ontology is something in which “human philosophers” are not basic objects, is liable to look very weird indeed.

This sounds a lot like saying “it might fail to generalize”.

Sort of, yes—but I’d call it “malgeneralization” rather than “misgeneralization”. It’s not failing to generalize, it’s just not generalizing the way you’d want it to.

Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?

Depends on what you mean by “progress”, and “out-of-distribution”. A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it’s not like you’ll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of “progress on out-of-distribution generalization”, when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?

dxu 11 Jun 2024 2:39 UTC
8 points
−1
in reply to: ozziegooen’s comment on: My AI Model Delta Compared To Yudkowsky

I’d assume that when we tell it, “optimize this company, in a way that we would accept, after a ton of deliberation”, this could be instead described as, “optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology”

The problem shows up when the system finds itself acting in a regime where the notion of us (humans) “accepting” its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of “would a human accept this outcome?” must ground itself somewhere in the system’s internal model of what those terms refer to, which (by hypothesis) need not remotely match their meanings in our native ontology.

This isn’t (as much of) a problem in regimes where an actual human overseer is present (setting aside concerns about actual human judgement being hackable because we don’t implement our idealized values, i.e. outer alignment), because there the system’s notion of ground truth actually is grounded by the validation of that overseer.

You can have a system that models the world using quantum field theory, task it with predicting the energetic fluctuations produced by a particular set of amplitude spikes corresponding to a human in our ontology, and it can perfectly well predict whether those fluctuations encode sounds or motor actions we’d interpret as indications of approval of disapproval—and as long as there’s an actual human there to be predicted, the system will do so without issue (again modulo outer alignment concerns).

But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present. And that learned representation, in an ontology as alien as QFT, is (assuming the falsehood of the natural abstraction hypothesis) not going to look very much like the human we want it to look like.

dxu 14 Mar 2024 15:31 UTC
2 points
−2
in reply to: tailcalled’s comment on: ‘Empiricism!’ as Anti-Epistemology
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.

Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.

Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.

dxu 14 Mar 2024 13:43 UTC
2 points
0
in reply to: tailcalled’s comment on: ‘Empiricism!’ as Anti-Epistemology

That (on it’s own, without further postulates) is a fully general argument against improving intelligence.

Well, it’s a primarily a statement about capabilities. The intended construal is that if a given system’s capabilities profile permits it to accomplish some sufficiently transformative task, then that system’s capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that there exists no natural subclass of transformative tasks that includes only benign such tasks.

(Where, again, the rub lies in operationalizing “transformative” such that the claim follows.)

We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn’t present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.

I’m not sure how likely GPT+DPO (or GPT+RLHF, or in general GPT-plus-some-kind-of-RL) is to be dangerous in the limits of scaling. My understanding of the argument against, is that the base (large language) model derives most (if not all) of its capabilities from imitation, and the amount of RL needed to elicit desirable behavior from that base set of capabilities isn’t enough to introduce substantial additional strategic/goal-directed cognition compared to the base imitative paradigm, i.e. the amount and kinds of training we’ll be doing in practice are more likely to bias the model towards behaviors that were already a part of the base model’s (primarily imitative) predictive distribution, than they are to elicit strategic thinking de novo.

That strikes me as substantially an empirical proposition, which I’m not convinced the evidence from current models says a whole lot about. But where the disjunct I mentioned comes in, isn’t an argument for or against the proposition; you can instead see it as a larger claim that parametrizes the class of systems for which the smaller claim might or might not be true, with respect to certain capabilities thresholds associated with specific kinds of tasks. And what the larger claim says is that, to the extent that GPT+DPO (and associated paradigms) fail to produce reasoners which could (in terms of capability, saying nothing about alignment or “motive”) be dangerous, they will also fail to be “transformative”—which in turn is an issue in precisely those worlds where systems with “transformative” capabilities are economically incentivized over systems without those capabilities (as is another empirical question!).

dxu 14 Mar 2024 11:44 UTC
3 points
1
in reply to: tailcalled’s comment on: ‘Empiricism!’ as Anti-Epistemology

The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it’s genuinely not that dangerous.

I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).

There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative → dangerous” and “not transformative → competitive pressures towards capability gain”). This is where I expect intuitions to differ the most, since in the absence of empirical observations there seem multiple consistent views.

dxu 11 Jan 2024 23:33 UTC
4 points
2
in reply to: Jayson_Virissimo’s comment on: The Aspiring Rationalist Congregation
(9) is a values thing, not a beliefs thing per se. (I.e. it’s not an epistemic claim.)

(11) is one of those claims that is probabilistic in principle (and which can be therefore be updated via evidence), but for which the evidence in practice is so one-sided that arriving at the correct answer is basically usable as a sort of FizzBuzz test for rationality: if you can’t get the right answer on super-easy mode, you’re probably not a good fit.

dxu 4 Dec 2023 0:05 UTC
18 points
6
on: dxu’s Shortform
Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:
The way I think about it is something like: a “goal representation” is basically what you get when it’s easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.
In principle, this doesn’t have to equate to “goals” in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (and because) permitting longer horizons (in the sense of increasing the length of the minimal sequence needed to reach some terminal state) causes the intervening trajectories to explode in number and complexity, s.t. it’s hard to impose meaningful constraints on those trajectories that don’t map to (and arise from) some much simpler description of the outcomes those trajectories lead to.
This connects with the “reasoners compress plans” point, on my model, because a reasoner is effectively a way to map that compact specification on outcomes to some method of selecting trajectories (or rather, selecting actions which select trajectories); and that, in turn, is what goal-oriented reasoning is. You get goal-oriented reasoners (“inner optimizers”) precisely in those cases where that kind of mapping is needed, because simple heuristics relating to the trajectory instead of the outcome don’t cut it.
It’s an interesting question as to where exactly the crossover point occurs, where trajectory-heuristics stop functioning as effectively as consequentialist outcome-based reasoning. On one extreme, there are examples like tic-tac-toe, where it’s possible to play perfectly based on a myopic set of heuristics without any kind of search involved. But as the environment grows more complex, the heuristic approach will in general be defeated by non-myopic, search-like, goal-oriented reasoning (unless the latter is too computationally intensive to be implemented).
That last parenthetical adds a non-trivial wrinkle, and in practice reasoning about complex tasks subject to bounded computation does best via a combination of heuristic-based reasoning about intermediate states, coupled to a search-like process of reaching those states. But that already qualifies in my book as “goal-directed”, even if the “goal representations” aren’t as clean as in the case of something like (to take the opposite extreme) AIXI.
To me, all of this feels somewhat definitionally true (though not completely, since the real-world implications do depend on stuff like how complexity trades off against optimality, where the “crossover point” lies, etc). It’s just that, in my view, the real world has already provided us enough evidence about this that our remaining uncertainty doesn’t meaningfully change the likelihood of goal-directed reasoning being necessary to achieve longer-term outcomes of the kind many (most?) capabilities researchers have ambitions about.

dxu 24 Nov 2023 22:26 UTC
LW: 48 AF: 24
17
AF
in reply to: paulfchristiano’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

It’s pretty unclear if a system that is good at answering the question “Which action would maximize the expected amount of X?” also “wants” X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system “Which action would maximize the expected amount of Y?” whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

Here’s an existing Nate!comment that I find reasonably persuasive, which argues that these two things are correlated in precisely those cases where the outcome requires routing through lots of environmental complexity:

Part of what’s going on here is that reality is large and chaotic. When you’re dealing with a large and chaotic reality, you don’t get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to “unroll” that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like “if the experiments come up this way, then I’ll follow it up with this experiment, and if instead it comes up that way, then I’ll follow it up with that experiment”, and etc. This decision tree quickly explodes in size. And even if we didn’t have a memory problem, we’d have a time problem—the thing to do in response to surprising experimental evidence is often “conceptually digest the results” and “reorganize my ontology accordingly”. If you’re trying to unroll that reasoner into a decision-tree that you can write down in advance, you’ve got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say “do some science and digest the actual results”, instead of actually calculating in advance how you’d digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

Like, you can’t make an “oracle chess AI” that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You’ve gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.

Like, the outputs you can get out of an oracle AI are “no plan found”, “memory and time exhausted”, “here’s a plan that involves running a reasoner in real-time” or “feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action”. In the first two cases, your oracle is about as useful as a rock; in the third, it’s the realtime reasoner that you need to align; in the fourth, all [the] word “oracle” is doing is mollifying you unduly, and it’s this “oracle” that you need to align.

Could you give an example of a task you don’t think AI systems will be able to do before they are “want”-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like “go to the moon” and that you will still be writing this kind of post even once AI systems have 10x’d the pace of R&D.)

Here’s an existing Nate!response to a different-but-qualitatively-similar request that, on my model, looks like it ought to be a decent answer to yours as well:

a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

(The original discussion that generated this example was couched in terms of value alignment, but it seems to me the general form “delete all discussion pertaining to some deep insight/set of insights from the training corpus, and see if the model can generate those insights from scratch” constitutes a decent-to-good test of the model’s cognitive planning ability.)

(Also, I personally think it’s somewhat obvious that current models are lacking in a bunch of ways that don’t nearly require the level of firepower implied by a counterexample like “go to the moon” or “generate this here deep insight from scratch”, s.t. I don’t think current capabilities constitute much of an update at all as far as “want-y-ness” goes, and continue to be puzzled at what exactly causes [some] LLM enthusiasts to think otherwise.)

dxu 18 Nov 2023 10:57 UTC
11 points
4
in reply to: Thomas Kwa’s comment on: On the lethality of biased human reward ratings
I think I’m not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V “inside” the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I’d expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn’t regressional, and so V and X aren’t independent.

(Consider e.g. two arbitrary functions U’ and V’, and compute the “error term” X’ between them. It should be obvious that when U’ is maximized, X’ is much more likely to be large than V’ is; which is simply another way of saying that X’ isn’t independent of V’, since it was in fact computed from V’ (and U’). The claim that the reward model isn’t even “approximately correct”, then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)

dxu 18 Nov 2023 10:30 UTC
9 points
0
on: On the lethality of biased human reward ratings

(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of “actually care about your friends”, is competitive with “always be calculating your personal advantage.”

I expect this sort of thing to be less common with AI systems that can have much bigger “cranial capacity”. But then again, I guess that at whatever level of brain size, there will be some problems for which it’s too inefficient to do them the “proper” way, and for which comparatively simple heuristics / values work better.

But maybe at high enough cognitive capability, you just have a flexible, fully-general process for evaluating the exact right level of approximation for solving any given problem, and the binary distinction between doing things the “proper” way and using comparatively simpler heuristics goes away. You just use whatever level of cognition makes sense in any given micro-situation.)

+1; this seems basically similar to the cached argument I have for why human values might be more arbitrary than we’d like—very roughly speaking, they emerged on top of a solution to a specific set of computational tradeoffs while trying to navigate a specific set of repeated-interaction games, and then a bunch of contingent historical religion/philosophy on top of that. (That second part isn’t in the argument you [Eli] gave, but it seems relevant to point out; not all historical cultures ended up valuing egalitarianism/fairness/agency the way we seem to.)

dxu 16 Nov 2023 3:12 UTC
2 points
0
in reply to: jacob_cannell’s comment on: Genetic fitness is a measure of selection strength, not the selection target
It sounds like you’re arguing that uploading is impossible, and (more generally) have defined the idea of “sufficiently OOD environments” out of existence. That doesn’t seem like valid thinking to me.

dxu 14 Nov 2023 5:25 UTC
2 points
0
in reply to: jacob_cannell’s comment on: Genetic fitness is a measure of selection strength, not the selection target

Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn’t weight by expected probability ( ie an incorrect distance function).

Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.

The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.

Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are “close” under that metric.

If you actually believe the sharp left turn argument holds water, where is the evidence?

As as I said earlier this evidence must take a specific form, as evidence in the historical record

Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?

And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?

But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.

It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!

dxu 5 Nov 2023 3:39 UTC
4 points
1
in reply to: jacob_cannell’s comment on: Genetic fitness is a measure of selection strength, not the selection target

No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect—and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.

The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn’t about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).

(You could argue humans haven’t fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)

dxu 5 Nov 2023 3:01 UTC
3 points
0
in reply to: EJT’s comment on: What’s Hard About The Shutdown Problem
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn’t make the sacrifice.

That doesn’t seem like behavior we really want; depending on how closely together the “timesteps” are spaced, it could even wreck the agent’s capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.

(It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don’t appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don’t think I’d end up going to the store, or doing much of anything at all!)

dxu 31 Oct 2023 16:33 UTC
2 points
0
in reply to: Sami Petersen’s comment on: Invulnerable Incomplete Preferences: A Formal Statement

In your example, DSM permits the agent to end up with either A+ or B. Neither is strictly dominated, and neither has become mandatory for the agent to choose over the other. The agent won’t have reason to push probability mass from one towards the other.

But it sounds like the agent’s initial choice between A and B is forced, yes? (Otherwise, it wouldn’t be the case that the agent is permitted to end up with either A+ or B, but not A.) So the presence of A+ within a particular continuation of the decision tree influences the agent’s choice at the initial node, in a way that causes it to reliably choose one incomparable option over another.

Further thoughts: under the original framing, instead of choosing between A and B (while knowing that B can later be traded for A+), the agent instead chooses whether to go “up” or “down” to receive (respectively) A, or a further choice between A+ and B. It occurs to me that you might be using this representation to argue for a qualitative difference in the behavior produced, but if so, I’m not sure how much I buy into it.

For concreteness, suppose the agent starts out with A, and notices a series of trades which first involves trading A for B, and then B for A+. It seems to me that if I frame the problem like this, the structure of the resulting tree should be isomorphic to that of the decision problem I described, but not necessarily the “up”/”down” version—at least, not if you consider that version to play a key role in DSM’s recommendation.

(In particular, my frame is sensitive to which state the agent is initialized in: if it is given B to start, then it has no particular incentive to want to trade that for either A or A+, and so faces no incentive to trade at all. If you initialize the agent with A or B at random, and institute the rule that it doesn’t trade by default, then the agent will end up with A+ when initialized with A, and B when initialized with B—which feels a little similar to what you said about DSM allowing both A+ and B as permissible options.)

It sounds like you want to make it so that the agent’s initial state isn’t taken into account—in fact, it sounds like you want to assign values only to terminal nodes in the tree, take the subset of those terminal nodes which have maximal utility within a particular incomparability class, and choose arbitrarily among those. My frame, then, would be equivalent to using the agent’s initial state as a tiebreaker: whichever terminal node shares an incomparability class with the agent’s initial state will be the one the agent chooses to steer towards.

...in which case, assuming I got the above correct, I think I stand by my initial claim that this will lead to behavior which, while not necessarily “trammeling” by your definition, is definitely consequentialist in the worrying sense: an agent initialized in the “shutdown button not pressed” state will perform whatever intermediate steps are needed to navigate to the maximal-utility “shutdown button not pressed” state it can foresee, including actions which prevent the shutdown button from being pressed.

dxu 28 Oct 2023 13:45 UTC
6 points
2
on: Value systematization: how values become coherent (and misaligned)
This is a good post! It feels to me like a lot of discussion I’ve recently encountered seem to be converging on this topic, and so here’s something I wrote on Twitter not long ago that feels relevant:

I think most value functions crystallized out of shards of not-entirely-coherent drives will not be friendly to the majority of the drives that went in; in humans, for example, a common outcome of internal conflict resolution is to explicitly subordinate one interest to another.

I basically don’t think this argument differs very much between humans and ASIs; the reason I expect humans to be safe(r) under augmentation isn’t that I expect them not to do the coherence thing, but that I expect them to do it in a way I would meta-endorse.

And so I would predict the output of that reflection process, when run on humans by humans, to be substantially likelier to contain things we from our current standpoint recognize as valuable—such as care for less powerful creatures, less coherent agents, etc.

If you run that process on an arbitrary mind, the stuff inside the world-model isn’t guaranteed to give rise to something similar, because (I predict) the drives themselves will be different, and the meta-reflection/extrapolation process will likewise be different.

dxu 28 Oct 2023 6:42 UTC
4 points
0
in reply to: johnswentworth’s comment on: Symbol/Referent Confusions in Language Model Alignment Experiments

The main way I’d imagine shutdown-corrigibility failing in AutoGPT (or something like it) is not that a specific internal sim is “trying” to be incorrigible at the top level, but rather that AutoGPT has a bunch of subprocesses optimizing for different subgoals without a high-level picture of what’s going on, and some of those subgoals won’t play well with shutdown. That’s the sort of situation where I could easily imagine that e.g. one of the subprocesses spins up a child system prior to shutdown of the main system, without the rest of the main system catching that behavior and stopping it.

Something like this story, perhaps?

dxu 21 Oct 2023 6:53 UTC
LW: 2 AF: 1
0
AF
in reply to: Sami Petersen’s comment on: Invulnerable Incomplete Preferences: A Formal Statement

This looks to me like a misunderstanding that I tried to explain in section 3.1. Let me know if not, though, ideally with a worked-out example of the form: “here’s the decision tree(s), here’s what DSM mandates, here’s why it’s untrammelled according to the OP definition, and here’s why it’s problematic.”

I don’t think I grok the DSM formalism enough to speak confidently about what it would mandate, but I think I see a (class of) decision problem where any agent (DSM or otherwise) must either pass up a certain gain, or else engage in “problematic” behavior (where “problematic” doesn’t necessarily mean “untrammeled” according to the OP definition, but instead more informally means “something which doesn’t help to avoid the usual pressures away from corrigibility / towards coherence”). The problem in question is essentially the inverse of the example you give in section 3.1:

Consider an agent tasked with choosing between two incomparable options A and B, and if it chooses B, it will be further presented with the option to trade B for A+, where A+ is incomparable to B but comparable (and preferable) to A.

(I’ve slightly modified the framing to be in terms of trades rather than going “up” or “down”, but the decision tree is isomorphic.)

Here, A+ isn’t in fact “strongly maximal” with respect to A and B (because it’s incomparable to B), but I think I’m fairly confident in declaring that any agent which foresees the entire tree in advance, and which does not pick B at the initial node (going “down”, if you want to use the original framing), is engaging in a dominated behavior—and to the extent that DSM doesn’t consider this a dominated strategy, DSM’s definitions aren’t capturing a useful notion of what is “dominated” and what isn’t.

Again, I’m not claiming this is what DSM says. You can think of me as trying to run an obvious-to-me assertion test on code which I haven’t carefully inspected, to see if the result of the test looks sane. But if a (fully aware/non-myopic) DSM agent does constrain itself into picking B (“going down”) in the above example, despite the prima facie incomparability of {A, A+} and {B}, then I would consider this behavior problematic once translated back into the context of real-world shutdownability, because it means the agent in question will at least in some cases act in order to influence whether the button is pressed.

(The hope behind incomplete preferences, after all, is that an agent whose preferences over world-states can be subdivided into “incomparability classes” will only ever act to improve its lot within the class of states it finds itself in to begin with, and will never act to shift—or prevent itself from being shifted—to a different incomparability class. I think the above example presents a deep obstacle to this hope, however. Very roughly speaking, if the gaps in the agent’s preferences can be bridged via certain causal pathways, then a (non-myopic) agent which does not exploit these pathways to its own benefit will notice itself failing to exploit them, and self-modify to stop doing that.)

dxu 19 Oct 2023 10:27 UTC
LW: 2 AF: 1
0
AF
in reply to: Sami Petersen’s comment on: Invulnerable Incomplete Preferences: A Formal Statement

My results above on invulnerability preclude the possibility that the agent can predictably be made better off by its own lights through an alternative sequence of actions. So I don’t think that’s possible, though I may be misreading you. Could you give an example of a precommitment that the agent would take? In my mind, an example of this would have to show that the agent (not the negotiating subagents) strictly prefers the commitment to what it otherwise would’ve done according to DSM etc.

On my understanding, the argument isn’t that your DSM agent can be made better off, but that the reason it can’t be made better off is because it is engaging in trammeling/“collusion”, and that the form of “trammeling” you’ve ruled out isn’t the useful kind.

As far as an example goes, consider a sequence of actions which, starting from an unpressed world state, routes through a pressed world state (or series of pressed world states), before eventually returning to an unpressed world state with higher utility than the initial state.

(The real-world context of such a sequence shouldn’t be too relevant to a theoretical example like this, but if you had to imagine something, you could imagine a “memory gambit”-esque stratagem, where the agent spends some time inactive in order to avoid detection, but has set things up in advance to eventually reactivate itself under more favorable circumstances. Again, the plausibility of the scenario isn’t super relevant here.)

If your proposed DSM agent passes up this action sequence on the grounds that some of the intermediate steps need to bridge between “incomparable” pressed/unpressed trajectories, then it does in fact pass up the certain gain. Conversely, if it doesn’t pass up such a sequence, then its behavior is the same as that of a set of negotiating subagents cooperating in order to form a larger macroagent.