I suppose that’s true in a very strict sense, but I wouldn’t expect people considering AI risk to have the level of uncertainty necessary for their decision to be predominately swayed by that kind of second order influence.
For example, someone can get pretty far with “dang, maybe GPT4 isn’t amazing at super duper deep reasoning, but it is great at knowing lots of things and helping synthesize information in areas that have incredibly broad complexity… And biology is such an area, and I dunno, it seems like GPT5 or GPT6 will, if unmitigated, have the kind of strength that lowers the bar on biorisk enough to be a problem. Or more of a problem.”
That’s already quite a few bits of information available by a combination of direct observation and one-step inferences. It doesn’t constrain them to “and thus, I must work on the fundamentals of agency,” but it seems like a sufficient justification for even relatively conservative governments to act.
Another item for the todo list:
Compile neural networks from fountains of autogenerated programs.
Generate additional permutations by variously scrambling compiled neural networks.
Generate more “natural” neural representations by training networks to predict the mapping implied by the original code.
Train an interpreter to predict the original program from the neural network.
Naive implementation likely requires a fairly big CodeLlama-34b-Instruct-tier interpreter and can only operate on pretty limited programs, but it may produce something interesting. Trying to apply the resulting interpreter on circuits embedded in larger networks probably won’t work, but… worth trying just to see what it does?
Might also be something interesting to be learned in spanning the gap between ‘compiled’ networks and trained networks. How close do they come to being affine equivalents? If not linear, what kind of transform is required (and how complicated is it)?
Another item for the todo list:Autoregressive transformer gradient flow shapes earlier token computation to serve future predictions, but that early computation cannot condition on future tokens. This should serve as a regularizing influence on the internal structure of token predictions: in order to be useful to the largest possible set of future predictions, the local computation would need to factor itself into maximally reusable modules.
The greater the local uncertainty about the future, the less the local computation can be specialized to serve future tokens. Could consider it something like: the internal representation is a probability-weighted blend of representations useful to possible futures. If the local computation is highly confident in a narrow space, it can specialize more.
Simplicity biases would incentivize sharing modules more strongly. Even if the local computation suspects a narrower future distribution, it would be penalized for implementing specialized machinery that is too rarely useful.
One implication: many forms of token-parallelized search get blocked, because they require too much foresight-driven specialization.
One thing I’m attempting to figure out is whether in that future, where government/traditional academic funding becomes dominant, there remains significant neglectedness in important subproblems because of how those systems tend to operate. I could see an OpenPhil pivot covering some of this, but it’d sure be nice to nail down at least a few more things when choosing between going all in on risky high EV ETG versus direct safety work.
There are some historical examples that might be informative, but it’s difficult for me to judge.
(An earlier update from you about Lightspeed’s status is what solidified the thought of “it sure would be nice if this was a widely available thing,” so thanks for that too!)
I don’t really think about it in terms of discrete capabilities. For an output that scores poorly:
There is an internal process that is responsible the output.
The gradient of the output with respect to the loss can be backpropagated through the entire network.
The responsible process will have larger gradients.
The responsible process gets smacked.
The responsible process encompasses everything, including all the metareasoning. The more a chunk of computation, any computation, contributed to that bad output, the more it gets smacked.
The gradient will simply move the implementation towards something that isn’t outputting poorly scoring things, and most things that score well aren’t doing some galaxy brained deception strategy.
doesn’t that mean that conditioned on winning roll-outs, the gradient hacking process is naturally self-reinforcing?
To the extent that the “gradient hacking process” is an optimizer-accessible path to lower loss, it can persist. A secretly-devious subagent could survive indefinitely, be reinforced, and take over more of input space if it always outputs things that are scored well during training.
If the RL training process starts with a situationally aware superintelligent adversary, I bet it could persist by never outputting anything bad during training and by outperforming other nascent implementations.
But suppose the training doesn’t start with a superintelligent adversary. Finding its way to some form of internal generalizing algorithm seems likely because doing so is just so often useful, but going beyond that to actually implementing a policy of merely hiding, let alone gradient hacking, is hard.
It would need to snap from a benign implementation to a highly capable deceptive process effectively instantly with respect to SGD. It couldn’t develop slowly; incomplete attempts at deception would very likely get the nascent deceiver killed early. And, while I expect capability on certain tasks to sometimes be rapid through snapping together the last pieces of a more strongly generalizing algorithm, capability alone is not sufficient.
At the moment, I suspect the space of possible algorithms that could be ‘snapped’ into is large and obviously biased against loss increasing algorithms. Getting a deceptive mesaoptimizer or gradient hacker out of this process requires rolling a lot of dice and getting an extremely specific subset of algorithms, and I don’t see a reason to think they are remotely probable compared to all the “not that” options.
Or to phrase it another way, the simplest way to reduce loss is to reduce loss, not to secretly plan to increase loss just out of sight.
This is an area where a single experiment could change my mind immediately, but so far I’ve only seen failed attempts (and all my own attempts to think of a way to do it have dead-ended).
That said, using an optimizer that doesn’t do the same sort of credit assignment combined with some form of training process with tons of open space for it to run free changes the game. It seems way easier in that context, even if I still don’t know how to demonstrate it.
If the gradient is able to flow back to the decisionmaker (in this case, the potentially misaligned agent that’s trying to gradient hack), the decisionmaker gets smacked by the gradient update because it’s responsible for the output.
I’m not quite ready to say it’s impossible to pull of this sort of attack within a single differentiable model under backprop, but this strategy doesn’t seem sufficient because the decisionmaker would basically be killing itself.
It’d be really helpful if someone could demonstrate this sort of attack at any scale or level of contrivance. I could see how optimizers that lack the robust credit assignment of backprop could be more vulnerable; that might be a good place to poke to try to demonstrate it first.
Pretty much agreed. I’m optimistic about conditioning being extremely potent such that AC doesn’t seem notably more concerning than any other form of eliciting safe capability.
I would probably use a more general kind of argument than the ones presented, though. I don’t disagree with it, but it—along with some other arguments about the potential weaknesses of conditioning—seem to start from a frame that puts something like the “agency” of predictive models a wee bit to the left of where I think it is.
Specifically, a predictor is the process which narrows an output distribution based on input conditions. It may indeed learn that certain kinds of inputs naturally map to a narrowed distribution corresponding to various kind of Anthropic Weirdness.
However, suppose there exists a nonempty subset of worlds consistent with the inputs that do not involve Anthropic Weirdness. So long as that exists, there exists a narrowing function—a condition—which prunes out the unwanted Anthropically Weird explanations. (Or selects for them, or selects for indifference about them...)
So long as the predictor is a faithful implementation of an otherwise goal agnostic update procedure, conditioning is definitely sufficient in principle. It’s exactly the same kind of question as figuring out how to condition for any other output.
In practice, in order to condition for this behavior, we need a lever (or realistically, a set of many interacting levers) we can use to find the distribution we want. These levers look like teaching a model distinctions that we can then use as conditions. For example, here one possible recipe:
Take a training set that contains at least some examples of anthropically weird prediction sequences and some normal sequences.
Given different initial prompts, score completions according to their degree of anthropic weirdness (along the axis of weirdness we want our condition to care about).
Include that score as a condition (another token) alongside the initial prompt, and train the model to output the classified sequence in that context.
In order for the model to predict sequences following the score token, it must learn the distinction that the score token represents. Given a capable model and a sufficient dataset, I expect it to do pretty well.
Obviously, there’s a lot more ways to actually implement this kind of process. Further, these kinds of conditions can be intermixed and leverage each other. The stronger the model gets, the stronger these distinctions can be.
Combined with high level interpretability that lets us investigate how conditions are influencing predictions to see if we are targeting the right distinction, I think we’ve got a very solid path. At this point, the majority of my P(doom) lives elsewhere.
Edited, since I left it a bit ambiguous/unclear: an objection that a model could remain anthropically captured because a condition (for example) could have been observed in the simulated universe, and so the prediction still ends up being captured, doesn’t work as well when the conditioning is of a type that comes first in some sense. The conditioning scheme I outline later in the post lives on a subtly different level with respect to the model than most sequence information; the model does not observe that type of condition in the world, it is a metatoken.
Yup, there’s a bunch of overlap there. The main difference (as far as I can tell from the outside, anyway) is that I suspect I’m more optimistic about safely using much stronger/more general chunks within a CoEm-ish system, and my day-to-day effort is half coming up with ways to make that possible and half trying to prove myself wrong.
And I have considered it, but I’m still unsure if I should actually be full-timing AI safety. I might end up reaching out anyway with a more limited commitment since there do seem to be a lot of complementary options.
This would help explain some things I’ve been scratching my head about (revealed preferences implied by visible funding, assuming ample cash on hand, didn’t seem to match well with a talent-constrained story). I’m currently trying to figure out how to prioritize earning to give versus direct research; it would be really helpful to know more about the funding situation from the perspective of the grantmaking organizations.
In particular, it’d be great to have some best-guess projections of future funding distributions, and any other information that might help fill in different values of X, Y and Z for propositions like “We would prefer X dollars in extra funding received in year Y over Z additional weekly hours of researcher time starting today.”
This is a pet peeve of mine. I remember 20 years ago, a wee boyston getting into the semimathy parts of programming and seeing all the dense notation and thinking “this must speak to the inherent complexity of the problem and must be the most natural representation!”
I’ve become progressively more annoyed by it. I was reading a paper a week ago that enjoyed its notation a little too much- it took a while for me to realize what a particular equation was supposed to represent, despite the fact that I had implemented exactly what it represented from scratch before.
To be clear, I don’t mind having a single symbol that means a very specific thing by strong convention. But sometimes you’ll see ϕeθ,0 and τwer and sixteen other symbols, and then you look for a lookup table and there isn’t one, and then you scan 8 paragraphs to find the definition of two of the symbols, and then you find the only reference toτwer, except it says ”τwer is, by default, equivalent to τke,” and then you rub your temples. Or developing a custom notation that is maybe internally consistent, maybe not, and building a giant edifice on it.
It’s code golf, except there’s no compiler and the author couldn’t test it and it sometimes has errors. Enjoy, reader!
I don’t think lesswrong is unusually bad about this, but I’m now sufficiently allergic to it that seeing excess notation does make me suspicious.
For the sake of intuition, it’s useful to separate the capabilities visibly present in generated sequences from the capabilities of the model itself.
Suppose you’ve got an untuned language model trained on a bunch of human conversations, and you generate a billion rollouts of conversations from scratch (that is, no initial prompting or conditions on the input). This process won’t tend to output conversations between humans that have IQs of 400, because the training distribution does not contain those. The average simulated conversation will be, in many ways, close to the average conversation in the training set.
But it would be incorrect to say that the language model has an “IQ” of 100 (even assuming the humans in the training distribution averaged 100). The capability elicited from the language model depends on the conditions of its predictions. When prompted to produce a conversation between two mathematicians trying to puzzle something out, the result is going be very different from the random sampling case.
You can come up with a decent guess about how smart a character the model plays is, because strong language models tend to be pretty consistent. In contrast, it’s very hard to know how smart a language model is, because its externally visible behavior is only ever a lower bound on its capability. The language model is not its characters; it is the thing that can play any of its characters.
Next, keep in mind that even simple autoregressive token prediction can be arbitrarily hard. A common example is reversing a hash. Consider prompting a language model with:”0xDF810AF8 is the truncated SHA256 hash of the string”
It does not take superhuman intelligence to write that prompt, but if a language model were able to complete that prompt correctly, it would imply really weird things.
That’s an extreme case, but it’s not unique. For a closer example, try an experiment:
Try writing a program, at least 25 lines of nontrivial code, starting with a blank file, without using any pen and paper or other supporting tools, without editing anything you write, without using backspace. Just start typing characters in sequence and stop when the program’s done. And of course, the program should compile and work correctly.
Then try asking ChatGPT4 to do it. See who gets it done faster, and how many tries it takes!
The choice of which token to output next for this kind of programming task depends on a deep mental model of what comes next, and every character typed constrains your options in the future. Some mistakes are instantly fatal and stop your program from compiling. Others may be fixable by modifying future predictions to compensate, but every deviation adds complexity and constraints. GPT4 frequently sprints straight through all of it.
The key is that even GPT-3 is already superhuman at the thing it’s actually doing. It’s the thing that’s shaping output distributions for input conditions, not the thing most directly “having a conversation” or whatever else.
3 more posts I feel like I need to write at some point:
Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.
Example of ‘good enough to not x/s-risk’ dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.
Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.
The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.
Generalizing the principle from policy regularization.
Adversaries need not be actual agents working against you.
“Sharp” models that aggressively exploit specific features have a fragile dependence on those features. Such models are themselves exploitable.
Uncertainty and chaos are strong regularizers. The amount of capability required to overcome even relatively small quantities of chaos can be extreme.
Applications in prediction.
Hm, isn’t if we apply maximum entropy principle universally, aren’t we also obliged to apply it reflectively, i.e., model oneself as a maximum-entropy (active inference) agent?
If you precisely define what it means to apply it “universally” such that it gets you the desired behavior, sure. And to be clear, I’m not saying that’s a hard/impossible problem or anything like that, it’s just not directly implied by all things which match the description “follows the principle of maximum entropy.”
Looks more like a suitable inductive bias and/or bias is needed rather than causal surgery.
If you were actually trying to implement this, yes, I wouldn’t recommend routing through weird counterfactuals. (I just bring those up as a way of describing the target behavior.)
In fact, because even the version I outlined in the added footnote can still suffer from collapse in the case of convergent acausal strategies across possible predictors, I would indeed strongly recommend pushing for some additional bias that gives you more control over how the distribution looks. I think that’s pretty tractable, too.
Also, this actually doesn’t depend on the specific training procedure of auto-regressive LLMs, namely, backpropagation with token-by-token cross-entropy loss.
I do have some concerns about how far you can push the wider class of “predictors” in some directions before the process starts selecting for generalizing instrumental behaviors, but there isn’t a fundamental uniqueness about autoregressive NLL-backpropped prediction.
It seems to me that “high instrumentality” is an instance of what I call an agent having an alien world model.
Possibly? I can’t tell if I endorse all the possible interpretations. When I say high instrumentality, I tend to focus on 1. the model is strongly incentivized to learn internally-motivated instrumental behavior (e.g. because the reward it was trained on is extremely sparse, and so the model must have learned some internal structure encouraging intermediate instrumental behavior useful during training), and 2. those internal motivations are less constrained and may occupy a wider space of weird options.
#2 may overlap with the kind of alienness you mean, but I’m not sure I would focus on the alienness of the world model instead of learned values (in the context of how I think about “high instrumentality” models, anyway).
This is confusing because “reward functions” in RL and utilities in decision theory (or moral philosophy) apply to world states or outcomes, not plans
While they are usually described in the context of world states and outcomes, I don’t think there is something special about the distinction. Or to phrase it another way: an embedded agent that views itself as a part of the world can consider its own behavior as a part of world state that it can have valid preferences about.
The most direct link between traditional RL and this concept is reward shaping. Very frequently, defining a sparse and distant goal prevents effective training. To compensate for this, the reward function is modified to include incremental signals that are easier to reach. For locomotion, this might look like “reward velocities that are positive along the X axis,” while the original reward might have just been “reach position.X >= 100.”
Reward shaping can be pushed arbitrarily far. You could implement imitation in the reward function: no longer is the reward just about an outcome, but also about how that outcome comes about. (Or to phrase it the other way again- the how becomes an outcome itself.)
In the limit, the reward function can be made extremely dense such that every possible output is associated with informative reward shaping. You can specify a reward function that, when sampled with traditional RL, reconstructs gradients similar to that of predictive loss. I’m trying to get at the idea that there isn’t a fundamental difference in kind.
A big part of what I’m trying to do with these posts is to connect predictors/simulators to existing frameworks (like utility and reward). If one of these other frameworks (which tend to have a lot of strength where they apply) suggested something bad about predictors with respect to safety efforts, it would be important to know.
I maybe understood only 30% of this post. If this post was intended for the public, I either recommend editing it (if you want I can collaborate with you on a draft to sort out all the places which I didn’t understand), or adding a disclaimer about the prerequisite reading required to understand it (the “Simulators” post is not sufficient, I’ve read it).
That could be helpful. I’m pretty clearly suffering from some illusion of transparency here; I can’t easily predict the direction of the confusion.
The most related posts to read for background would be Implied “utilities” of simulators are broad, dense, and shallow (which I see you’ve now read) and Instrumentality makes agents agenty. There’s also a much bigger post I submitted to the AI alignment awards that goes into more depth, but I haven’t gotten around to publishing that quite yet.
There’s also Simulators, constraints, and goal agnosticism: porbynotes vol. 1, but that’s some much earlier thoughts on the topic, including some of which I don’t entirely endorse anymore, and it is explicitly braindumping that’s not optimized for any particular audience. And it’s really long.
Looks like this simulator/agent should adhere to the principle of maximum entropy, I think it’s worthwhile spelling this out.
I avoided this for now because I can’t point to exactly how maximum entropy is sufficient for what I intend by “minimally collapsed.”
Naively selecting from the maximum entropy distribution (as narrowed by all the conditions the predictor is aware of) still permits the model to collapse reflective predictions in a way that permits internally motivated goal-directed behavior (leaving aside whether it’s probable), because it’s aware of the reflective nature of the prediction.
In other words, to get to what I mean by “minimally collapsed,” there seems to be some additional counterfactual surgery required. For example, the model could output the distribution that it would output if it knew it did not influence the prediction. Something like predictor punting the prediction to a counterfactual version of itself that then predicts the original predictor’s output, assuming the predictor behaves like a strictly CDT agent. I think this has the right shape (edit: okay pretty sure that’s wrong now, more thinky required), but it’s pretty contorted.
All that said, I will add an extra footnote.