I intended the three to be probability and utility and steam, but it might make more sense to categorize things in other ways. While I still think there might be something more interesting here, I nowadays mainly think of Steam as the probability distribution over future actions and action-related concepts. This makes Steam an epistemic object, like any other belief, but with more normative/instrumental content because it’s beliefs about actions, and because there will be a lot of FixDT stuff going on in such beliefs. Kickstarter / “belief-in” dynamics also seem extremely relevant.
abramdemski(Abram Demski)
Here are some different things that come to mind.
As you mention, the simulacra behaves in an agentic way within its simulated environment, a character in a story. So the capacity to emulate agency is there. Sometimes characters can develop awareness that they are a character in a story. If an LLM is simulating that scenario, doesn’t it seem appropriate (at least on some level) to say that there is real agency being oriented toward the real world? This is “situational awareness”.
Another idea is that the LLM has to learn some strategic planning in order to direct its cognitive resources efficiently toward the task of prediction. Prediction is a very complicated task, so this meta-cognition could in principle become arbitrarily complicated. In principle we might expect this to converge toward some sort of consequentialist reasoning, because that sort of reasoning is generically useful for approaching complex domains. The goals of this consequentialist reasoning do not need to be exactly “predict accurately” however; they merely need to be adequately aligned with this in the training distribution.
Combining #1 and #2, if the model gets some use out of developing consequentialist metacognition, and the pseudo-consequentialist model used to simulate characters in stories is “right there”, the model might borrow it for metacognitive purposes.
The frame I tend to think about it with is not exactly “how does it develop agency” but rather “how is agency ruled out”. Although NNs don’t neatly separate into different hypotheses (eg, circuits can work together rather than just compete with each other) it is still roughly right to think of NN training as rejecting lots of hypotheses and keeping around lots of other hypotheses. Some of these hypotheses will be highly agentic; we know NNs are capable of arriving at highly agentic policies in specific cases. So there’s a question of whether those hypotheses can be ruled out in other cases. And then there’s the more empirical question of, if we haven’t entirely ruled out those agentic hypotheses, what degree of influence do they realistically have?
Seemingly the training data cannot entirely rule out an agentic style of reasoning (such as deceptive alignment), since agents can just choose to behave like non-agents. So, the inner alignment problem becomes: what other means can we use to rule out a large agentic influence? (Eg, can we argue that simplicity prior favors “honest” predictive models over deceptively aligned agents temporarily playing along with the prediction game?) The general concern is: no one has yet articulated a convincing answer, so far as I know.
Hence, I regard the problem more as a lack of any argument ruling out agency, rather than the existence of a clear positive argument that agency will arise. Others may have different views on this.
Technologies and Terminology: AI isn’t Software, it’s… Deepware?
I am thinking of this as a noise-reducing modification to the loss function, similar to using model-based rather than model-free learning (which, if done well, rewards/punishes a policy based on the average reward/punishment it would have gotten over many steps).
If science were incentivized via prediction market (and assuming scientists can make sizable bets by taking out loans), then the first person to predict a thing wins most of the money related to it. In other words, prediction markets are approximately parade-leader-incentivizing.
But if there’s a race to be the first to bet, then this reward is high-variance; Newton could get priority over Leibniz by getting his ideas to the market a little faster.
You recommend dividing credit more to all the people who could have gotten information to the market, with some kind of time-discount for when they could have done it. If we conceive of “who won the race” as introducing some noise into the credit-assignment, this is a way to de-noise things.
This has the consequence of taking away a lot of credit from race-winners when the race was pretty big, which is the part you focus on; based on this idea, you want to be part of smaller races (ideally size 1). But, outside-view, you should have wanted this all along anyway; if you are racing for status, but you are part of a big race, only a small number of people can win anyway, so your outside-view probability of personally winning status should already be divided by the number of racers. To think you have a good chance of winning such a race you must have personal reasons, and (since being in the race selects, in part, for people who think they can win) they’re probably overconfident.
So for the most part your advice has no benefit for calibrated people, since being a parade-leader is hard.
There are for sure cases where your metric comes apart from expected-parade-leading by a lot more, though. A few years ago I heard accusations that one of the big names behind Deep Learning earned their status by visiting lots of research groups and keeping an eye out for what big things were going to happen next, and managing to publish papers on these big things just a bit ahead of everyone else. This strategy creates the appearance of being a fountain of information, when in fact the service provided is just a small speed boost to pre-existing trends. (I do not recall who exactly was being accused, and I don’t have a lot of info on the reliability of this assessment anyway, it was just a rumor.)
Typically, people say that the market is mostly efficient, and if there was financial alpha to be gained by doing hiring differently from most corporations, then there would already be companies outcompeting others by doing that. Well, here’s a company doing some things differently and outcompeting other companies. Maybe there aren’t enough people willing to do such things (who have the resources to) for the returns to reach an equilibrium?
Well, it could be that the practices lead to high-variance results, so that you should mostly expect companies which operate like that to fail, but you also expect a few unusually large wins.
But I’m not familiar enough with the specific case to say anything substantial.
I am not sure whether I am more excited about ‘positive’ approaches (accelerating alignment research more) vs ‘negative’ approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.
So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?
Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?
For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by “bad at philosophy”.
But more generally, I have a sense that I just get BS from Claude, even when it isn’t specifically trying to shoehorn its own safety measures in as the solution.
Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There’s also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.
(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)
So this seems like a generally dangerous overall dynamic—LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse.
I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.
- 20 Jan 2024 6:13 UTC; 3 points) 's comment on Being nicer than Clippy by (
Thanks!
I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand.
I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
I agree the example sucks and only serves to prove that it is not a logical implication.
A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.
I still think the list of lethalities is a pretty good reply to your overall line of reasoning—IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating.
You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but “things tend to be easy when you lift the requirement of perfection” is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.
I finally got around to reading this today, because I have been thinking about doing more interpretability work, so I wanted to give this piece a chance to talk me out of it.
It mostly didn’t.
A lot of this boils down to “existing interpretability work is unimpressive”. I think this is an important point, and significant sub-points were raised to argue it. However, it says little ‘against almost every theory of impact of interpretability’. We can just do better work.
A lot of the rest boils down to “enumerative safety is dumb”. I agree, at least for the version of “enumerative safety” you argue against here.
My impact story (for the work I am considering doing) is most similar to the “retargeting” story which you briefly mention, but barely critique.
I do think the world would be better off if this were required reading for anyone considering going into interpretability vs other areas. (Barring weird side-effects of the counterfactual where someone has the ability to enforce required reading...) It is a good piece of work which raises many important points.
More generally, if we grant that we don’t need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it’s often doable to solve problems of you don’t need perfect or arbitrarily good solutions.
It seems really easy to spell out worldviews where “we don’t need perfection, or arbitrarily good alignment” but yet “alignment should be really easy”. To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle—so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors).
But I might also think that properly “checking” even one feature is really, really hard.
So I don’t buy the claimed implication: “we don’t need perfection” does not imply “alignment should be really easy”. Indeed, I think the implication quite badly fails.
Compare this to a similar argument that a hardware enthusiast could use to argue against making a software/hardware distinction. You can argue that saying “software” is misleading because it distracts from the physical reality. Software is still present physically somewhere in the computer. Software doesn’t do anything hardware can’t do, since software doing is just hardware doing.
But thinking in this way will not be a very good way of predicting reality. The hypothetical hardware enthusiast would not be able to predict the rise of the “programmer” profession, or the great increase in complexity of things that machines can do thanks to “programming”.
I think it is more helpful to think of modern AI as a paradigm shift in the same way that the shift from “electronic” (hardware) to “digital” (software) was a paradigm shift. Sure, you can still use the old paradigm to put labels on things. Everything is “still hardware”. But doing so can miss an important transition.
While I agree that wedding photos and NN weights are both data, and this helps to highlight ways they “aren’t software”, I think this undersells the point. NN weights are “active” in ways wedding photos aren’t. The classic code/data distinction has a mostly-OK summary: code is data of type function. Code is data which can be “run” on other data.
NN weights are “of type function” too: the usual way to use them is to “run” them. Yet, it is pretty obvious that they are not code in the traditional sense.
So I think this is similar to a hardware geek insisting that code is just hardware configuration, like setting a dial or flipping a set of switches. To the hypothetical hardware geek, everything is hardware; “software” is a physical thing just as much as a wire is. An arduino is just a particularly inefficient control circuit.
So, although from a hardware perspective you basically always want to replace an arduino with a more special-purpose chip, “something magical” happens when we move to software—new sorts of things become possible.
Similarly, looking at AI as data rather than code may be a way to say that AI “isn’t software” within the paradigm of software, but it is not very helpful for understanding the large shift that is taking place. I think it is better to see this as a new layer in somewhat the same way as software was a new layer on top of hardware. The kinds of thinking you need to do in order to do something with hardware vs do something with software are quite different, but ultimately, more similar to each other than they both are to how to do something with AI.
Ah, very interesting, thanks! I wonder if there is a different way to measure relative endorsement that could achieve transitivity.
Yeah, the stuff in the updatelessness section was supposed to gesture at how to handle this with my definition.
First of all, I think children surprise me enough in pursuit of their own goals that they do often count as agents by the definition in the post.
But, if children or animals who are intuitively agents often don’t fit the definition in the post, my idea is that you can detect their agency by looking at things with increasingly time/space/data bounded probability distributions. I think taking on “smaller” perspectives is very important.
I can feel what you mean about arbitrarily drawing a circle around the known optimizer and then “deleting” it, but this just doesn’t feel that weird to me? Like I think the way that people model the world allows them to do this kind of operation with pretty substantially meaningful results.
I agree, but I am skeptical that there could be a satisfying mathematical notion here. And I am particularly skeptical about a satisfying mathematical notion that doesn’t already rely on some other agent-detector piece which helps us understand how to remove the agent.
I think this is where Flint’s framework was insightful. Instead of “detecting” and “deleting” the optimization process and then measuring the diff, you consider the system of every possible trajectory, measure the optimization of each (with respect to the ordering over states), take the average, and then compare your potential optimizer to this.
Looking back at Flint’s work, I don’t agree with this summary. His idea is more about spotting attractor basins in the dynamics. There is no “compare your optimizer to this” step which I can see, since he studies the dynamics of the entire system. He suggests that in cases where it is meaningful to make an optimizer/optimized distinction, this could be detected by noticing that a specific region (the ‘optimizer’) is sensitive to very small perturbations, which can take the whole system out of the attractor basin.
In any case, I agree that Flint’s work also eliminates the need for an unnatural baseline in which we have to remove the agent.
Overall, I expect my definition to be more useful to alignment, but I don’t currently have a well-articulated argument for that conclusion. Here are some comparison points:
Flint’s definition requires a system with stable dynamics over time, so that we can define an iteration rule. My definition can handle that case, but does not require it. So, for example, Flint’s definition doesn’t work well for a goal like “become President in 2030”—it works better for continual goals, like “be president”.
Flint’s notion of robustness involves counterfactual perturbations which we may never see in the real world. I feel a bit suspicious about this aspect. Can counterfactual perturbations we’ll never see in practice be really relevant and useful for reasoning about alignment?
Flint’s notion is based more on the physical system, whereas mine is more about how we subjectively view that system.
I feel that “endorsement” comes closer to a concept of alignment. Because of the subjective nature of endorsement, it comes closer to formalizing when an optimizer is trusted, rather than merely good at its job.
It seems more plausible that we can show (with plausible normative assumptions about our own reasoning) that we (should) absolutely endorse some AI, in comparison to modeling the world in sufficient detail to show that building the AI would put us into a good attractor basin.
I suspect Flint’s definition suffers more from the value change problem than mine, although I think I haven’t done the work necessary to make this clear.
There are several compromises I made for the sake of getting the idea across as simply as I could.
I think the graduate-level-textbook version of this would be much more clear about what the quotes are doing. I was tempted to not even include the quotes in the mathematical expressions, since I don’t think I’m super clear about why they’re there.
I totally ignored the difference between (probability conditional on ) and (probability after learning ).
I neglect to include quantifiers in any of my definitions; the reader is left to guess which things are implicitly universally quantified.
I think I do prefer the version I wrote, which uses rather than , but obviously the English-language descriptions ignore this distinction and make it sound like what I really want is .
It seems like the intention is that “learns” or “hears about” ’s belief, and then updates (in the above Bayesian inference sense) to have a new that has the consistency condition with .
Obviously we can consider both possibilities and see where that goes, but I think maybe the conditional version makes more sense as a notion of whether you right now endorse something. A conditional probability is sort of like a plan for updating. You won’t necessarily follow the plan exactly when you actually update, but the conditional probability is your best estimate.
To throw some terminology out there, let’s call my thing “endorsement” and a version which uses actual updates rather than conditionals “deference” (because you’d actually defer to their opinions if you learn them).
You can know whether you endorse something, since you can know your current conditional probabilities (to within some accuracy, anyway). It is harder to know whether you defer to something, since in the case where updates don’t equal conditionals, you must not know what you are going to update to. I think it makes more sense to define the intentional stance in terms of something you can more easily know about yourself.
Using endorsement to define agency makes it about how you reason about specific hypotheticals, whereas using deference to try and define agency would make it about what actually happens in those hypotheticals (ie, how you would actually update if you learned a thing). Since you might not ever get to learn that thing, this makes endorsement more well-defined than deference.
Bayes’ theorem is the statement about , which is true from the axioms of probability theory for any and whatsoever.
I actually prefer the view of Alan Hajek (among others) who holds that P(A|B) is a primitive, not defined as in Bayes’ ratio formula for conditional probability. Bayes’ ratio formula can be proven in the case where P(B)>0, but if P(B)=0 it seems better to say that conditional probabilities can exist rather than necessarily being undefined. For example, we can reason about the conditional probability that a meteor hits land given that it hits the equator, even if hitting the equator is a measure zero event. Statisticians learn to compute such things in advanced stats classes, and it seems sensible to unify such notions under the formal P(A|B) rather than insisting that they are technically some other thing.
By putting in the conditional, you’re saying that it’s an event on , a thing with the same type as . And it feels like that’s conceptually correct, but also kind of the hard part. It’s as if is modelling as an agent embedded into .
Right. This is what I was gesturing at with the quotes. There has to be some kind of translation from (which is a mathematical concept ‘outside’ ) to an event inside . So the quotes are doing something similar to a Goedel encoding.
While trying to understand the equations, I found it easier to visualize and as two separate distributions on the same , where endorsement is simply a consistency condition. For belief consistency, you would just say that endorses on event if .
But that isn’t what you wrote; instead you wrote thing this with conditioning on a quoted thing. And of course, the thing I said is symmetrical between and , whereas your concept of endorsement is not symmetrical.
The asymmetry is quite important. If we could only endorse things that have exactly our opinions, we could never improve.
Yes, I think it is fair to say that I meant the Shoggoth part, although I’m a little wary of that dichotomy utilized in a load-bearing way.
No room for agency at all? If this were well-reasoned, I would consider it major progress on the inner alignment problem. But I fail to follow your line of thinking. Something being architecturally an input-output function seems not that closely related to what kind of universe it “lives” in. Part of the lesson of transformer architectures, in my view at least, was that giving a next-token-predictor a long input context is more practical than trying to train RNNs. What this suggests is that given a long context window, LLMs reconstruct the information which would have been kept around in a recurrent state pretty well anyway.
This makes it not very plausible that the key dividing line between agentic and non-agentic is whether the architecture keeps state around.
The argument I sketched as to why this input-output function might learn to be agentic was that it is tackling an extremely complex task, which might benefit from some agentic strategy. I’m still not saying such an argument is correct, but perhaps it will help to sketch why this seems plausible. Modern LLMs are broadly thought of as “attention” algorithms, meaning they decide what parts of sequences to focus on. Separately, many people think it is reasonable to characterize modern LLMs as having a sort of world-model which gets consulted to recall facts. Where to focus attention is a consideration which will have lots of facets to it, of course. But in a multi-stage transformer, isn’t it plausible that the world-model gets consulted in a way that feeds into how attention is allocated? In other words, couldn’t attention-allocation go through a relatively consequentialist circuit at times, which essentially asks itself a question about how it expects things to go if it allocates attention in different ways?
Any specific repeated calculation of that kind could get “memorized out”, replaced with a shorter circuit which simply knows how to proceed in those circumstances. But it is possible, in theory at least, that the more general-purpose reasoning, going through the world-model, would be selected for due to its broad utility in a variety of circumstances.
Since the world-model-consultation is only selected to be useful for predicting the next token, the consequentialist question which the system asks its world-model could be fairly arbitrary so long as it has a good correlation with next-token-prediction utility on the training data.
Is this planning? IE does the “query to the world-model” involve considering multiple plans and rejecting worse ones? Or is the world-model more of a memorized mess of stuff with no “moving parts” to its computation? Well, we don’t really know enough to say (so far as I am aware). Input-output type signatures do not tell us much about the simplicity or complexity of calculations within. “It’s just circuits” but large circuits can implement some pretty sophisticated algorithms. Big NNs do not equal big lookup tables.