I seem to recall hearing a phrase I liked, which appears to concisely summarize the concern as: “There’s no canonical way to scale me up.”
Does that sound right to you?
I seem to recall hearing a phrase I liked, which appears to concisely summarize the concern as: “There’s no canonical way to scale me up.”
Does that sound right to you?
Well, if we’re following standard ML best practices, we have a train set, a dev set, and a test set. The purpose of the dev set is to check and ensure that things are generalizing properly. If they aren’t generalizing properly, we tweak various hyperparameters of the model and retrain until they do generalize properly on the dev set. Then we do a final check on the test set to ensure we didn’t overfit the dev set. If you forgot or never learned this stuff, I highly recommend brushing up on it.
(Just to be clear: yes, I know what training and test sets are, as well as dev sets/validation sets. You might notice I actually used the phrase “validation set” in my earlier reply to you, so it’s not a matter of guessing someone’s password—I’m quite familiar with these concepts, as someone who’s implemented ML models myself.)
Generally speaking, training, validation, and test datasets are all sourced the same way—in fact, sometimes they’re literally sourced from the same dataset, and the delineation between train/dev/test is introduced during training itself, by arbitrarily carving up the original dataset into smaller sets of appropriate size. This may capture the idea of “IID” you seem to appeal to elsewhere in your comment—that it’s possible to test the model’s generalization performance on some held-out subset of data from the same source(s) it was trained on.
In ML terms, what the thought experiment points to is a form of underlying distributional shift, one that isn’t (and can’t be) captured by “IID” validation or test datasets. The QFT model in particular highlights the extent to which your training process, however broad or inclusive from a parochial human standpoint, contains many incidental distributional correlates to your training signal which (1) exist in all of your data, including any you might hope to rely on to validate your model’s generalization performance, and (2) cease to correlate off-distribution, during deployment.
This can be caused by what you call “omniscience”, but it need not; there are other, more plausible distributional differences that might be picked up on by other kinds of models. But QFT is (as far as our current understanding of physics goes) very close to the base ontology of our universe, and so what is inferrable using QFT is naturally going to be very different from what is inferrable using some other (less powerful) ontology. QFT is a very powerful ontology!
If you want to call that “omniscience”, you can, although note that strictly speaking the model is still just working from inferences from training data. It’s just that, if you feed enough data to a model that can hold entire swaths of the physical universe inside of its metaphorical “head”, pretty soon hypotheses that involve the actual state of that universe will begin to outperform hypotheses that don’t, and which instead use some kind of lossy approximation of that state involving intermediary concepts like “intent”, “belief”, “agent”, “subjective state”, etc.
In principle we could construct a test set or dev set either before or after the model has been trained. It shouldn’t make a difference under normal circumstances. It sounds like maybe you’re discussing a scenario where the model has achieved a level of omniscience, and it does fine on data that was available during its training, because it’s able to read off of an omniscient world-model. But then it fails on data generated in the future, because the translation method for its omniscient world-model only works on artifacts that were present during training. Basically, the time at which the data was generated could constitute a hidden and unexpected source of distribution shift. Does that summarize the core concern?
You’re close; I’d say the concern is slightly worse than that. It’s that the “future data” never actually comes into existence, at any point. So the source of distributional shift isn’t just “the data is generated at the wrong time”, it’s “the data never gets externally generated to begin with, and you (the model) have to work with predictions of what the data counterfactually would have been, had it been generated”.
(This would be the case e.g. with any concept of “human approval” that came from a literal physical human or group of humans during training, and not after the system was deployed “in the wild”.)
In any case, I would argue that “accidental omniscience” characterizes the problem better than “alien abstractions”. As before, you can imagine an accidentally-omniscient model that uses vanilla abstractions, or a non-omniscient model that uses alien ones.
The problem is that “vanilla” abstractions are not the most predictively useful possible abstractions, if you’ve got access to better ones. And models whose ambient hypothesis space is broad enough to include better abstractions (from the standpoint of predictive accuracy) will gravitate towards those, as is incentivized by the outer form of the training task. QFT is the extreme example of a “better abstraction”, but in principle (if the natural abstraction hypothesis fails) there will be all sorts and shapes of abstractions, and some of them will be available to us, and some of them will be available to the model, and these sets will not fully overlap—which is a concern in worlds where different abstractions lead to different generalization properties.
I think it ought to be possible for someone to always be present. [I’m also not sure it would be necessary.]
I think I don’t understand what you’re imagining here. Are you imagining a human manually overseeing all outputs of something like ChatGPT, or Microsoft Copilot, before those outputs are sent to the end user (or, worse yet, put directly into production)?
[I also think I don’t understand why you make the bracketed claim you do, but perhaps hashing that out isn’t a conversational priority.]
As I understand this thought experiment, we’re doing next-token prediction on e.g. a book written by a philosopher, and in order to predict the next token using QFT, the obvious method is to use QFT to simulate the philosopher. But that’s not quite enough—you also need to read the next token out of that QFT-based simulation if you actually want to predict it.
It sounds like your understanding of the thought experiment differs from mine. If I were to guess, I’d guess that by “you” you’re referring to someone or something outside of the model, who has access to the model’s internals, and who uses that access to, as you say, “read” the next token out of the model’s ontology. However, this is not the setup we’re in with respect to actual models (with the exception perhaps of some fairly limited experiments in mechanistic interpretability)—and it’s also not the setup of the thought experiment, which (after all) is about precisely what happens when you can’t read things out of the model’s internal ontology, because it’s too alien to be interpreted.
In other words: “you” don’t read the next token out of the QFT simulation. The model is responsible for doing that translation work. How do we get it to do that, even though we don’t know how to specify the nature of the translation work, much less do it ourselves? Well, simple: in cases where we have access to the ground truth of the next token, e.g. because we’re having it predict an existing book passage, we simply penalize it whenever its output fails to match the next token in the book. In this way, the model can be incentivized to correctly predict whatever we want it to predict, even if we wouldn’t know how to tell it explicitly to do whatever it’s doing.
(The nature of this relationship—whereby humans train opaque algorithms to do things they wouldn’t themselves be able to write out as pseudocode—is arguably the essence of modern deep learning in toto.)
For one thing, in a standard train/dev/test setup, the model is arguably not really doing prediction, it’s doing retrodiction. It’s making ‘predictions’ about things which already happened in the past. The final model is chosen based on what retrodicts the data the best.
Yes, this is a reasonable description to my eyes. Moreover, I actually think it maps fairly well to the above description of how a QFT-style model might be trained to predict the next token of some body of text; in your terms, this is possible specifically because the text already exists, and retrodictions of that text can be graded based on how well they compare against the ground truth.
Also, usually the data is IID rather than sequential—there’s no time component to the data points (unless it’s a time-series problem, which it usually isn’t).
This, on the other hand, doesn’t sound right to me. Yes, there are certainly applications where the training regime produces IID data, but next-token prediction is pretty clearly not one of those? Later tokens are highly conditionally dependent on previous tokens, in a way that’s much closer to a time series than to some kind of IID process. Possibly part of the disconnect is that we’re imagining different applications entirely—which might also explain our differing intuitions w.r.t. deployment?
The fact that we’re choosing a model which retrodicts well is why the presence/absence of a human is generally assumed to be irrelevant, and emphasizing this factor sounds wacky to my ML engineer ears.
Right, so just to check that we’re on the same page: do we agree that after a (retrodictively trained) model is deployed for some use case other than retrodicting existing data—for generative use, say, or for use in some kind of online RL setup—then it’ll doing something other than retrodicting? And that in that situation, the source of (retrodictable) ground truth that was present during training—whether that was a book, a philosopher, or something else—will be absent?
If we do actually agree about that, then that distinction is really all I’m referring to! You can think of it as training set versus test set, to use a more standard ML analogy, except in this case the “test set” isn’t labeled at all, because no one labeled it in advance, and also it’s coming in from an unpredictable outside world rather than from a folder on someone’s hard drive.
Why does that matter? Well, because then we’re essentially at the mercy of the model’s generalization properties, in a way we weren’t while it was retrodicting the training set (or even the validation set, if one of those existed). If it gets anything wrong, there’s no longer any training signal or gradient to penalize it for being “wrong”—so the only remaining question is, just how likely is it to be “wrong”, after being trained for however long it was trained?
And that’s where the QFT model comes in. It says, actually, even if you train me for a good long while on a good amount of data, there are lots of ways for me to generalize “wrongly” from your perspective, if I’m modeling the universe at the level of quantum fields. Sure, I got all the retrodictions right while there was something to be retrodicted, but what exactly makes you think I did that by modeling the philosopher whose remarks I was being trained on?
Maybe I was predicting the soundwaves passing through a particularly region of air in the room he was located—or perhaps I was predicting the pattern of physical transistors in the segment of memory of a particular computer containing his works. Those physical locations in spacetime still exist, and now that I’m deployed, I continue to make predictions using those as my referent—except, the encodings I’m predicting there no longer resemble anything like coherent moral philosophy, or coherent anything, really.
The philosopher has left the room, or the computer’s memory has been reconfigured—so what exactly are the criteria by which I’m supposed to act now? Well, they’re going to be something, presumably—but they’re not going to be something explicit. They’re going to be something implicit to my QFT ontology, something that—back when the philosopher was there, during training—worked in tandem with the specifics of his presence, and the setup involving him, to produce accurate retrodictions of his judgements on various matters.
Now that that’s no longer the case, those same criteria describe some mathematical function that bears no meaningful correspondence to anything a human would recognize, valuable or not—but the function exists, and it can be maximized. Not much can be said about what maximizing that function might result in, except that it’s unlikely to look anything like “doing right according to the philosopher”.
That’s why the QFT example is important. A more plausible model, one that doesn’t think natively in terms of quantum amplitudes, permits the possibility of correctly compressing what we want it to compress—of learning to retrodict, not some strange physical correlates of the philosopher’s various motor outputs, but the actual philosopher’s beliefs as we would understand them. Whether that happens, or whether a QFT-style outcome happens instead, depends in large part on the inductive biases of the model’s architecture and the training process—inductive biases on which the natural abstraction hypothesis asserts a possible constraint.
I’m confused about what it means to “remove the human”, and why it’s so important whether the human is ‘removed’.
Because the human isn’t going to constantly be present for everything the system does after it’s deployed (unless for some reason it’s not deployed).
If I can assume that stuff, then it feels like a fairly core task, abundantly stress-tested during training, to read off the genius philosopher’s spoken opinions about e.g. moral philosophy from the quantum fields. How else could quantum fields be useful for next-token predictions?
Quantum fields are useful for an endless variety of things, from modeling genius philosophers to predicting lottery numbers. If your next-token prediction task involves any physically instantiated system, a model that uses QFT will be able to predict that system’s time-evolution with alacrity.
(Yes, this is computationally intractable, but we’re already in full-on hypothetical land with the QFT-based model to begin with. Remember, this is an exercise in showing what happens in the worst-case scenario for alignment, where the model’s native ontology completely diverges from our own.)
So we need not assume that predicting “the genius philosopher” is a core task. It’s enough to assume that the model is capable of it, among other things—which a QFT-based model certainly would be. Which, not so coincidentally, brings us to your next question:
Is alignment supposed to be hard in this hypothetical because the AI can’t represent human values in principle? Or is it supposed to be hard because it also has a lot of unsatisfactory representations of human values, and there’s no good method for finding a satisfactory needle in the unsatisfactory haystack? Or some other reason?
Consider how, during training, the human overseer (or genius philosopher, if you prefer) would have been pointed out to the model. We don’t have reliable access to its internal world-model, and even if we did we’d see blobs of amplitude and not much else. There’s no means, in that setting, of picking out the human and telling the model to unambiguously defer to that human.
What must happen instead, then, is something like next-token prediction: we perform gradient descent (or some other optimization method; it doesn’t really matter for the purposes of our story) on the model’s outputs, rewarding it when its outputs happen to match those of the human. The hope is that this will lead, in the limit, to the matching no longer occurring by happenstance—that if we train for long enough and in a varied enough set of situations, the best way for the model to produce outputs that track those of the human is to model that human, even in its QFT ontology.
But do we know for a fact that this will be the case? Even if it is, what happens when the overseer isn’t present to provide their actual feedback, as was never the case during training? What becomes the model’s referent then? We’d like to deploy it without an overseer, or in situations too complex for an overseer to understand. And whether the model’s behavior in those situations conforms to what the overseer would want, ideally, depends on what kinds of behind-the-scenes extrapolation the model is doing—which, if the model’s native ontology is something in which “human philosophers” are not basic objects, is liable to look very weird indeed.
This sounds a lot like saying “it might fail to generalize”.
Sort of, yes—but I’d call it “malgeneralization” rather than “misgeneralization”. It’s not failing to generalize, it’s just not generalizing the way you’d want it to.
Supposing we make a lot of progress on out-of-distribution generalization, is alignment getting any easier according to you? Wouldn’t that imply our systems are getting better at choosing proxies which generalize even when the human isn’t ‘present’?
Depends on what you mean by “progress”, and “out-of-distribution”. A powerful QFT-based model can make perfectly accurate predictions in any scenario you care to put it in, so it’s not like you’ll observe it getting things wrong. What experiments, and experimental outcomes, are you imagining here, such that those outcomes would provide evidence of “progress on out-of-distribution generalization”, when fundamentally the issue is expected to arise in situations where the experimenters are themselves absent (and which—crucially—is not a condition you can replicate as part of an experimental setup)?
I’d assume that when we tell it, “optimize this company, in a way that we would accept, after a ton of deliberation”, this could be instead described as, “optimize this company, in a way that we would accept, after a ton of deliberation, where these terms are described using our ontology”
The problem shows up when the system finds itself acting in a regime where the notion of us (humans) “accepting” its optimizations becomes purely counterfactual, because no actual human is available to oversee its actions in that regime. Then the question of “would a human accept this outcome?” must ground itself somewhere in the system’s internal model of what those terms refer to, which (by hypothesis) need not remotely match their meanings in our native ontology.
This isn’t (as much of) a problem in regimes where an actual human overseer is present (setting aside concerns about actual human judgement being hackable because we don’t implement our idealized values, i.e. outer alignment), because there the system’s notion of ground truth actually is grounded by the validation of that overseer.
You can have a system that models the world using quantum field theory, task it with predicting the energetic fluctuations produced by a particular set of amplitude spikes corresponding to a human in our ontology, and it can perfectly well predict whether those fluctuations encode sounds or motor actions we’d interpret as indications of approval of disapproval—and as long as there’s an actual human there to be predicted, the system will do so without issue (again modulo outer alignment concerns).
But remove the human, and suddenly the system is no longer operating based on its predictions of the behavior of a real physical system, and is instead operating from some learned counterfactual representation consisting of proxies in its native QFT-style ontology which happened to coincide with the actual human’s behavior while the human was present. And that learned representation, in an ontology as alien as QFT, is (assuming the falsehood of the natural abstraction hypothesis) not going to look very much like the human we want it to look like.
To the extent that I buy the story about imitation-based intelligences inheriting safety properties via imitative training, I correspondingly expect such intelligences not to scale to having powerful, novel, transformative capabilities—not without an amplification step somewhere in the mix that does not rely on imitation of weaker (human) agents.
Since I believe this, that makes it hard for me to concretely visualize the hypothetical of a superintelligent GPT+DPO agent that nevertheless only does what is instructed. I mostly don’t expect to be able to get to superintelligence without either (1) the “RL” portion of the GPT+RL paradigm playing a much stronger role than it does for current systems, or (2) using some other training paradigm entirely. And the argument for obedience/corrigibility becomes weaker/nonexistent respectively in each of those cases.
Possibly we’re in agreement here? You say you expect GPT+DPO to stagnate and be replaced by something else; I agree with that. I merely happen to think the reason it will stagnate is that its safety properties don’t come free; they’re bought and paid for by a price in capabilities.
That (on it’s own, without further postulates) is a fully general argument against improving intelligence.
Well, it’s a primarily a statement about capabilities. The intended construal is that if a given system’s capabilities profile permits it to accomplish some sufficiently transformative task, then that system’s capabilities are not limited to only benign such tasks. I think this claim applies to most intelligences that can arise in a physical universe like our own (though necessarily not in all logically possible universes, given NFL theorems): that there exists no natural subclass of transformative tasks that includes only benign such tasks.
(Where, again, the rub lies in operationalizing “transformative” such that the claim follows.)
We have to accept some level of danger inherent in existence; the question is what makes AI particularly dangerous. If this special factor isn’t present in GPT+DPO, then GPT+DPO is not an AI notkilleveryoneism issue.
I’m not sure how likely GPT+DPO (or GPT+RLHF, or in general GPT-plus-some-kind-of-RL) is to be dangerous in the limits of scaling. My understanding of the argument against, is that the base (large language) model derives most (if not all) of its capabilities from imitation, and the amount of RL needed to elicit desirable behavior from that base set of capabilities isn’t enough to introduce substantial additional strategic/goal-directed cognition compared to the base imitative paradigm, i.e. the amount and kinds of training we’ll be doing in practice are more likely to bias the model towards behaviors that were already a part of the base model’s (primarily imitative) predictive distribution, than they are to elicit strategic thinking de novo.
That strikes me as substantially an empirical proposition, which I’m not convinced the evidence from current models says a whole lot about. But where the disjunct I mentioned comes in, isn’t an argument for or against the proposition; you can instead see it as a larger claim that parametrizes the class of systems for which the smaller claim might or might not be true, with respect to certain capabilities thresholds associated with specific kinds of tasks. And what the larger claim says is that, to the extent that GPT+DPO (and associated paradigms) fail to produce reasoners which could (in terms of capability, saying nothing about alignment or “motive”) be dangerous, they will also fail to be “transformative”—which in turn is an issue in precisely those worlds where systems with “transformative” capabilities are economically incentivized over systems without those capabilities (as is another empirical question!).
The methods we already have are not sufficient to create ASI, and also if you extrapolate out the SOTA methods at larger scale, it’s genuinely not that dangerous.
I think I like the disjunct “If it’s smart enough to be transformative, it’s smart enough to be dangerous”, where the contrapositive further implies competitive pressures towards creating something dangerous (as opposed to not doing that).
There’s still a rub here—namely, operationalizing “transformative” in such a way as to give the necessary implications (both “transformative → dangerous” and “not transformative → competitive pressures towards capability gain”). This is where I expect intuitions to differ the most, since in the absence of empirical observations there seem multiple consistent views.
(9) is a values thing, not a beliefs thing per se. (I.e. it’s not an epistemic claim.)
(11) is one of those claims that is probabilistic in principle (and which can be therefore be updated via evidence), but for which the evidence in practice is so one-sided that arriving at the correct answer is basically usable as a sort of FizzBuzz test for rationality: if you can’t get the right answer on super-easy mode, you’re probably not a good fit.
Something I wrote recently as part of a private conversation, which feels relevant enough to ongoing discussions to be worth posting publicly:
The way I think about it is something like: a “goal representation” is basically what you get when it’s easier to state some compact specification on the outcome state, than it is to state an equivalent set of constraints on the intervening trajectories to that state.
In principle, this doesn’t have to equate to “goals” in the intuitive, pretheoretic sense, but in practice my sense is that this happens largely when (and because) permitting longer horizons (in the sense of increasing the length of the minimal sequence needed to reach some terminal state) causes the intervening trajectories to explode in number and complexity, s.t. it’s hard to impose meaningful constraints on those trajectories that don’t map to (and arise from) some much simpler description of the outcomes those trajectories lead to.
This connects with the “reasoners compress plans” point, on my model, because a reasoner is effectively a way to map that compact specification on outcomes to some method of selecting trajectories (or rather, selecting actions which select trajectories); and that, in turn, is what goal-oriented reasoning is. You get goal-oriented reasoners (“inner optimizers”) precisely in those cases where that kind of mapping is needed, because simple heuristics relating to the trajectory instead of the outcome don’t cut it.
It’s an interesting question as to where exactly the crossover point occurs, where trajectory-heuristics stop functioning as effectively as consequentialist outcome-based reasoning. On one extreme, there are examples like tic-tac-toe, where it’s possible to play perfectly based on a myopic set of heuristics without any kind of search involved. But as the environment grows more complex, the heuristic approach will in general be defeated by non-myopic, search-like, goal-oriented reasoning (unless the latter is too computationally intensive to be implemented).
That last parenthetical adds a non-trivial wrinkle, and in practice reasoning about complex tasks subject to bounded computation does best via a combination of heuristic-based reasoning about intermediate states, coupled to a search-like process of reaching those states. But that already qualifies in my book as “goal-directed”, even if the “goal representations” aren’t as clean as in the case of something like (to take the opposite extreme) AIXI.
To me, all of this feels somewhat definitionally true (though not completely, since the real-world implications do depend on stuff like how complexity trades off against optimality, where the “crossover point” lies, etc). It’s just that, in my view, the real world has already provided us enough evidence about this that our remaining uncertainty doesn’t meaningfully change the likelihood of goal-directed reasoning being necessary to achieve longer-term outcomes of the kind many (most?) capabilities researchers have ambitions about.
It’s pretty unclear if a system that is good at answering the question “Which action would maximize the expected amount of X?” also “wants” X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system “Which action would maximize the expected amount of Y?” whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.
Here’s an existing Nate!comment that I find reasonably persuasive, which argues that these two things are correlated in precisely those cases where the outcome requires routing through lots of environmental complexity:
Part of what’s going on here is that reality is large and chaotic. When you’re dealing with a large and chaotic reality, you don’t get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to “unroll” that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like “if the experiments come up this way, then I’ll follow it up with this experiment, and if instead it comes up that way, then I’ll follow it up with that experiment”, and etc. This decision tree quickly explodes in size. And even if we didn’t have a memory problem, we’d have a time problem—the thing to do in response to surprising experimental evidence is often “conceptually digest the results” and “reorganize my ontology accordingly”. If you’re trying to unroll that reasoner into a decision-tree that you can write down in advance, you’ve got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.
Reasoners are a way of compressing plans, so that you can say “do some science and digest the actual results”, instead of actually calculating in advance how you’d digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)
Like, you can’t make an “oracle chess AI” that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You’ve gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.
Like, the outputs you can get out of an oracle AI are “no plan found”, “memory and time exhausted”, “here’s a plan that involves running a reasoner in real-time” or “feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action”. In the first two cases, your oracle is about as useful as a rock; in the third, it’s the realtime reasoner that you need to align; in the fourth, all [the] word “oracle” is doing is mollifying you unduly, and it’s this “oracle” that you need to align.
Could you give an example of a task you don’t think AI systems will be able to do before they are “want”-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like “go to the moon” and that you will still be writing this kind of post even once AI systems have 10x’d the pace of R&D.)
Here’s an existing Nate!response to a different-but-qualitatively-similar request that, on my model, looks like it ought to be a decent answer to yours as well:
a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
(The original discussion that generated this example was couched in terms of value alignment, but it seems to me the general form “delete all discussion pertaining to some deep insight/set of insights from the training corpus, and see if the model can generate those insights from scratch” constitutes a decent-to-good test of the model’s cognitive planning ability.)
(Also, I personally think it’s somewhat obvious that current models are lacking in a bunch of ways that don’t nearly require the level of firepower implied by a counterexample like “go to the moon” or “generate this here deep insight from scratch”, s.t. I don’t think current capabilities constitute much of an update at all as far as “want-y-ness” goes, and continue to be puzzled at what exactly causes [some] LLM enthusiasts to think otherwise.)
I think I’m not super into the U = V + X framing; that seems to inherently suggest that there exists some component of the true utility V “inside” the proxy U everywhere, and which is merely perturbed by some error term rather than washed out entirely (in the manner I’d expect to see from an actual misspecification). In a lot of the classic Goodhart cases, the source of the divergence between measurement and desideratum isn’t regressional, and so V and X aren’t independent.
(Consider e.g. two arbitrary functions U’ and V’, and compute the “error term” X’ between them. It should be obvious that when U’ is maximized, X’ is much more likely to be large than V’ is; which is simply another way of saying that X’ isn’t independent of V’, since it was in fact computed from V’ (and U’). The claim that the reward model isn’t even “approximately correct”, then, is basically this: that there is a separate function U being optimized whose correlation with V within-distribution is in some sense coincidental, and that out-of-distribution the two become basically unrelated, rather than one being expressible as a function of the other plus some well-behaved error term.)
(Which, for instance, seems true about humans, at least in some cases: If humans had the computational capacity, they would lie a lot more and calculate personal advantage a lot more. But since those are both computationally expensive, and therefore can be caught-out by other humans, the heuristic / value of “actually care about your friends”, is competitive with “always be calculating your personal advantage.”
I expect this sort of thing to be less common with AI systems that can have much bigger “cranial capacity”. But then again, I guess that at whatever level of brain size, there will be some problems for which it’s too inefficient to do them the “proper” way, and for which comparatively simple heuristics / values work better.
But maybe at high enough cognitive capability, you just have a flexible, fully-general process for evaluating the exact right level of approximation for solving any given problem, and the binary distinction between doing things the “proper” way and using comparatively simpler heuristics goes away. You just use whatever level of cognition makes sense in any given micro-situation.)
+1; this seems basically similar to the cached argument I have for why human values might be more arbitrary than we’d like—very roughly speaking, they emerged on top of a solution to a specific set of computational tradeoffs while trying to navigate a specific set of repeated-interaction games, and then a bunch of contingent historical religion/philosophy on top of that. (That second part isn’t in the argument you [Eli] gave, but it seems relevant to point out; not all historical cultures ended up valuing egalitarianism/fairness/agency the way we seem to.)
It sounds like you’re arguing that uploading is impossible, and (more generally) have defined the idea of “sufficiently OOD environments” out of existence. That doesn’t seem like valid thinking to me.
Notice I replied to that comment you linked and agreed with John, but not that any generalized vector dot product model is wrong, but that the specific one in that post is wrong as it doesn’t weight by expected probability ( ie an incorrect distance function).
Anyway I used that only as a convenient example to illustrate a model which separates degree of misalignment from net impact, my general point does not depend on the details of the model and would still stand for any arbitrarily complex non-linear model.
The general point being that degree of misalignment is only relevant to the extent it translates into a difference in net utility.
Sure, but if you need a complicated distance metric to describe your space, that makes it correspondingly harder to actually describe utility functions corresponding to vectors within that space which are “close” under that metric.
If you actually believe the sharp left turn argument holds water, where is the evidence?
As as I said earlier this evidence must take a specific form, as evidence in the historical record
Hold on; why? Even for simple cases of goal misspecification, the misspecification may not become obvious without a sufficiently OOD environment; does that thereby mean that no misspecification has occurred?
And in the human case, why does it not suffice to look at the internal motivations humans have, and describe plausible changes to the environment for which those motivations would then fail to correspond even approximately to IGF, as I did w.r.t. uploading?
But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.
It seems to me that this suffices to establish that the primary barrier against such a breakdown in correspondence is that of insufficient capabilities—which is somewhat the point!
No AI we create will be perfectly aligned, so instead all that actually matters is the net utility that AI provides for its creators: something like the dot product between our desired future trajectory and that of the agents. More powerful agents/optimizers will move the world farther faster (longer trajectory vector) which will magnify the net effect of any fixed misalignment (cos angle between the vectors), sure. But that misalignment angle is only relevant/measurable relative to the net effect—and by that measure human brain evolution was an enormous unprecedented success according to evolutionary fitness.
The vector dot product model seems importantly false, for basically the reason sketched out in this comment; optimizing a misaligned proxy isn’t about taking a small delta and magnifying it, but about transitioning to an entirely different policy regime (vector space) where the dot product between our proxy and our true alignment target is much, much larger (effectively no different from that of any other randomly selected pair of vectors in the new space).
(You could argue humans haven’t fully made that phase transition yet, and I would have some sympathy for that argument. But I see that as much more contingent than necessarily true, and mainly a consequence of the fact that, for all of our technological advances, we haven’t actually given rise to that many new options preferable to us but not to IGF. On the other hand, something like uploading I would expect to completely shatter any relation our behavior has to IGF maximization.)
It looks a bit to me like your Timestep Dominance Principle forbids the agent from selecting any trajectory which loses utility at a particular timestep in exchange for greater utility at a later timestep, regardless of whether the trajectory in question actually has anything to do with manipulating the shutdown button? After all, conditioning on the shutdown being pressed at any point after the local utility loss but before the expected gain, such a decision would give lower sum-total utility within those conditional trajectories than one which doesn’t make the sacrifice.
That doesn’t seem like behavior we really want; depending on how closely together the “timesteps” are spaced, it could even wreck the agent’s capabilities entirely, in the sense of no longer being able to optimize within button-not-pressed trajectories.
(It also doesn’t seem to me a very natural form for a utility function to take, assigning utility not just to terminal states, but to intermediate states as well, and then summing across the entire trajectory; humans don’t appear to behave this way when making plans, for example. If I considered the possibility of dying at every instant between now and going to the store, and permitted myself only to take actions which Pareto-improve the outcome set after every death-instant, I don’t think I’d end up going to the store, or doing much of anything at all!)
Can we not speak of apparent coherence relative to a particular standpoint? If a given system seems to be behaving in such a way that you personally can’t see a way to construct for it a Dutch book, a series of interactions with it such that energy/negentropy/resources can be extracted from it and accrue to you, that makes the system inexploitable with respect to you, and therefore at least as coherent as you are. The closer to maximal coherence a given system is, the less it will visibly depart from the appearance of coherent behavior, and hence utility function maximization; the fact that various quibbles can be made about various coherence theorems does not seem to me to negate this conclusion.
Humans are more coherent than mice, and there are activities and processes which individual humans occasionally undergo in order to emerge more coherent than they did going in; in some sense this is the way it has to be, in any universe where (1) the initial conditions don’t start out giving you fully coherent embodied agents, and (2) physics requires continuity of physical processes, so that fully formed coherent embodied agents can’t spring into existence where there previously were none; there must be some pathway from incoherent, inanimate matter from which energy may be freely extracted, to highly organized configurations of matter from which energy may be extracted only with great difficulty, if it can be extracted at all.
If you expect the endpoint of that process to not fully accord with the von Neumann-Morgenstein axioms, because somebody once challenged the completeness axiom, independence axiom, continuity axiom, etc., the question still remains as to whether departures from those axioms will give rise to exploitable holes in the behavior of such systems, from the perspective of much weaker agents such as ourselves. And if the answer is “no”, then it seems to me the search for ways to make a weaker, less coherent agent into a stronger, more coherent agent is well-motivated, and necessary—an appeal to consequences in a certain sense, yes, but one that I endorse!