# johnswentworth

Karma: 32,753
• 2 Dec 2022 18:50 UTC
LW: 6 AF: 4
0 ∶ 0
AF

Does this mean that you expect we will be able to build advanced AI that doesn’t become an expected utility maximizer?

When talking about whether some physical system “is a utility maximizer”, the key questions are “utility over what variables?”, “in what model do those variables live?”, and “with respect to what measuring stick?”. My guess is that a corrigible AI will be a utility maximizer over something, but maybe not over the AI-operator interface itself? I’m still highly uncertain what that type-signature will look like, but there’s a lot of degrees of freedom to work with.

Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods?

We’ll need qualitatively different methods. But that’s not new; interpretability researchers already come up with qualitatively new methods pretty regularly.

• 2 Dec 2022 18:40 UTC
2 points
0 ∶ 0
in reply to: sudo -i’s comment

Some general types of value which are generally obtained by taking theories across the theory-practice gap:

• Finding out where the theory is wrong

• Direct value from applying the theory

• Creating robust platforms upon which further tools can be developed

• 2 Dec 2022 18:35 UTC
LW: 3 AF: 3
0 ∶ 0
AF
in reply to: Thane Ruthenis’s comment

Has the FTX fiasco impacted your expectation of us-in-the-future having enough money=compute to do the latter?

Basically no.

I’d like to make a case that Do What I Mean will potentially turn out to be the better target than corrigibility/​value learning. …

I basically buy your argument, though there’s still the question of how safe a target DWIM is.

• 2 Dec 2022 18:32 UTC
LW: 18 AF: 7
0 ∶ 0
AF
in reply to: Thane Ruthenis’s comment

Still on the “figure out agency and train up an aligned AGI unilaterally” path?

“Train up an AGI unilaterally” doesn’t quite carve my plans at the joints.

One of the most common ways I see people fail to have any effect at all is to think in terms of “we”. They come up with plans which “we” could follow, for some “we” which is not in fact going to follow that plan. And then they take political-flavored actions which symbolically promote the plan, but are not in fact going to result in “we” implementing the plan. (And also, usually, the “we” in question is too dysfunctional as a group to implement the plan even if all the individuals wanted to, because that is how approximately 100% of organizations of more than 10 people operate.) In cognitive terms, the plan is pretending that lots of other peoples’ actions are choosable/​controllable, when in fact those other peoples’ actions are not choosable/​controllable, at least relative to the planner’s actual capabilities.

The simplest and most robust counter to this failure mode is to always make unilateral plans.

But to counter the failure mode, plans don’t need to be completely unilateral. They can involve other people doing things which those other people will actually predictably do. So, for instance, maybe I’ll write a paper about natural abstractions in hopes of nerd-sniping some complex systems theorists to further develop the theory. That’s fine; the actions which I need to counterfact over in order for that plan to work are actions which I can in fact take unilaterally (i.e. write a paper). Other than that, I’m just relying on other people acting in ways in which they’ll predictably act anyway.

Point is: in order for a plan to be a “real plan” (as opposed to e.g. a fabricated option, or a de-facto applause light), all of the actions which the plan treats as “under the planner’s control” must be actions which can be taken unilaterally. Any non-unilateral actions need to be things which we actually expect people to do by default, not things we wish they would do.

Coming back to the question: my plans certainly do not live in some childrens’ fantasy world where one or more major AI labs magically become the least-dysfunctional multiple-hundred-person organizations on the planet, and then we all build an aligned AGI via the magic of Friendship and Cooperation. The realistic assumption is that large organizations are mostly carried wherever the memetic waves drift. Now, the memetic waves may drift in a good direction—if e.g. the field of alignment does indeed converge to a paradigm around decoding the internal language of nets and expressing our targets in that language, then there’s a strong chance the major labs follow that tide, and do a lot of useful work. And I do unilaterally have nonzero ability to steer that memetic drift—for instance, by creating public knowledge of various useful lines of alignment research converging, or by training lots of competent people.

That’s the sort of non-unilaterality which I’m fine having in my plans: relying on other people to behave in realistic ways, conditional on me doing things which I can actually unilaterally do.

• 2 Dec 2022 17:52 UTC
LW: 20 AF: 11
2 ∶ 3
AF
in reply to: Thane Ruthenis’s comment

Any changes to your median timeline until AGI, i. e., do we actually have these 9-14 years?

Here’s a dump of my current timeline models. (I actually originally drafted this as part of the post, then cut it.)

My current intuition is that deep learning is approximately one transformer-level paradigm shift away from human-level AGI. (And, obviously, once we have human-level AGI things foom relatively quickly.) That comes from an intuitive extrapolation: if something were about as much better as the models of the last 2-3 years, as the models of the last 2-3 years are compared to pre-transformer models, then I’d expect them to be at least human-level. That does not mean that nets will get to human level immediately after that transformer-level shift comes along; e.g. with transformers it still took ~2-3 years before transformer models really started to look impressive.

So the most important update from deep learning over the past year has been the lack of any transformer-level paradigm shift in algorithms, architectures, etc.

There are of course other potential paths to human-level (or higher) which don’t route through a transformer-level paradigm shift in deep learning. One obvious path is to just keep scaling; I expect we’ll see a paradigm shift well before scaling alone achieves human-level AGI (and this seems even more likely post-Chinchilla). The main other path is that somebody wires together a bunch of GPT-style AGIs in such a way that they achieve greater intelligence by talking to each other (sort of like how humans took off via cultural accumulation); I don’t think that’s very likely to happen near-term, but I do think it’s the main path by which 5-year timelines would happen without a paradigm shift. Call it maybe 5-10%. Finally, of course, there’s always the “unknown unknowns” possibility.

### How long until the next shift?

Back around 2014 or 2015, I was visiting my alma mater, and a professor asked me what I thought about the deep learning wave. I said it looked pretty much like all the previous ML/​AI hype cycles: everyone would be very excited for a while and make grand claims, but the algorithms would be super finicky and unreliable. Eventually the hype would die down, and we’d go into another AI winter. About ten years after the start of the wave someone would show that the method (in this case large CNNs) was equivalent to some Bayesian model, and then it would make sense when it did/​didn’t work, and it would join the standard toolbox of workhorse ML algorithms. Eventually some new paradigm would come along, and the hype cycle would start again.

… and in hindsight, I think that was basically correct up until transformers came along around 2017. Pre-transformer nets were indeed very finicky, and were indeed shown equivalent to some Bayesian model about ten years after the excitement started, at which point we had a much better idea of what they did and did not do well. The big difference from previous ML/​AI hype waves was that the next paradigm—transformers—came along before the previous wave had died out. We skipped an AI winter; the paradigm shift came in ~5 years rather than 10-15.

… and now it’s been about five years since transformers came along. Just naively extrapolating from the two most recent data points says it’s time for the next shift. And we haven’t seen that shift yet. (Yes, diffusion models came along, but those don’t seem likely to become a transformer-level paradigm shift; they don’t open up whole new classes of applications in the same way.)

So on the one hand, I’m definitely nervous that the next shift is imminent. On the other hand, it’s already very slightly on the late side, and if another 1-2 years go by I’ll update quite a bit toward that shift taking much longer.

Also, on an inside view, I expect the next shift to be quite a bit more difficult than the transformers shift. (I don’t plan to discuss the reasons for that, because spelling out exactly which technical hurdles need to be cleared in order to get nets to human level is exactly the sort of thing which potentially accelerates the shift.) That inside view is a big part of why my timelines last year were 10-15 years, and not 5. The other main reasons my timelines were 10-15 years were regression to the mean (i.e. the transformers paradigm shift came along very unusually quickly, and it was only one data point), general hype-wariness, and an intuitive sense that unknown unknowns in this case will tend to push toward longer timelines rather than shorter on net.

Put all that together, and there’s a big blob of probability mass on ~5 year timelines; call that 20-30% or so. But if we get through the next couple years without a transformer-level paradigm shift, and without a bunch of wired-together GPTs spontaneously taking off, then timelines get a fair bit lot longer, and that’s where my median world is.

# The Plan − 2022 Update

1 Dec 2022 20:43 UTC
153 points
• I’d add that everything in this post is still relevant even if the AGI in question isn’t explicitly modelling itself as in a simulation, attempting to deceive human operators, etc. The more-general takeaway of the argument is that certain kinds of distribution shift will occur between training and deployment—e.g. a shift to a “large reality”, universe which embeds the AI and has simple physics, etc. Those distribution shifts potentially make training behavior a bad proxy for deployment behavior, even in the absence of explicit malign intent of the AI toward its operators.

• One subtlety which I’d expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).

Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form

(where the ‘s and ’s are column vectors). That middle part is just the shared singular value times a 2x2 identity matrix:

But the 2x2 identity matrix can be rewritten as a 2x2 rotation times its inverse :

… and then we can group and with and , respectively, to rotate the singular vectors:

Since and are still orthogonal, the end result is another valid singular vector decomposition of the same matrix.

Upshot: when a singular value is repeated, the singular vectors are defined only up to a rotation (where the dimension of the rotation is the number of repeats of the singular value).

What this means practically/​conceptually is that, if two singular vectors have very close singular values, then a small amount of noise in the matrix will typically “mix them together”. So for instance, the post shows a plot of singular vectors for the OV matrix, and a whole bunch of the singular values are very close together. Conceptually, that means the corresponding singular vectors are all probably “mixed together” to a large extent. Insofar as they all have roughly-the-same singular value, the singular vectors themselves are underdefined/​unstable; what’s fully specified is the span of singular vectors with the same singular value.

(In fact, for the singular value distribution shown for the OV matrix in the post, nearly all the singular values are either approximately 10, or approximately 0. So that particular matrix is approximately a projection matrix, and the span of the singular vectors on either side gives the space projected from/​to.)

• One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don’t imply that the plan fails...

I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn’t actually the main problem which needs to be solved here.

(Or perhaps you’re intentionally ignoring that problem by assuming “goal-alignment”?)

• First: what’s the load-bearing function of visualizations in math?

I think it’s the same function as prototypical examples more broadly. They serve as a consistency check—i.e. if there’s any example at all which matches the math then at least the math isn’t inconsistent. They also offer direct intuition for which of the assumptions are typically “slack” vs “taut”—i.e. in the context of the example, would the claim just totally fall apart if we relax a particular assumption, or would it gracefully degrade? And they give some intuition for what kinds-of-things to bind the mathematical symbols to, in order to apply the math.

Based on that, I’d expect that non-visual prototypical examples can often serve a similar role.

Also, some people use type-tracking to get some of the same benefits, though insofar as that’s a substitute for prototypical example tracking I think it’s usually inferior.

• I’ve been trying to push against the tendency for everyone to talk about FTX drama lately, but I have some generalizable points on the topic which I haven’t seen anybody else make, so here they are. (Be warned that I may just ignore responses, I don’t really want to dump energy into FTC drama.)

Summary: based on having worked in startups a fair bit, Sam Bankman-Fried’s description of what happened sounds probably accurate; I think he mostly wasn’t lying. I think other people do not really get the extent to which fast-growing companies are hectic and chaotic and full of sketchy quick-and-dirty workarounds and nobody has a comprehensive view of what’s going on.

Long version: at this point, the assumption/​consensus among most people I hear from seems to be that FTX committed intentional, outright fraud. And my current best guess is that that’s mostly false. (Maybe in the very last couple weeks before the collapse they toed the line into outright lies as a desperation measure, but even then I think they were in pretty grey territory.)

Key pieces of the story as I currently understand it:

• Moving money into/​out of crypto exchanges is a pain. At some point a quick-and-dirty solution was for customers to send money to Alameda (Sam Bankman-Fried’s crypto hedge fund), and then Alameda would credit them somehow on FTX.

• Customers did rather a lot of that. Like, $8B worth. • The FTX/​Alameda team weren’t paying attention to those particular liabilities; they got lost in the shuffle. • At some point in the weeks before the collapse, when FTX was already under moderate financial strain, somebody noticed the$8B liability sitting around. And that took them from “moderate strain” to “implode”.

How this contrasts with what seems-to-me to be the “standard story”: most people seem to assume that it is just totally implausible to accidentally lose track of an \$8B liability. Especially when the liability was already generated via the decidedly questionable practice of routing customer funds for the exchange through a hedge fund owned by the same people. And therefore it must have been intentional—in particular, most people seem to think the liability was intentionally hidden.

I think the main reason I disagree with others on this is that I’ve worked at a startup. About 5 startups, in fact, over the course of about 5 years.

The story where there was a quick-and-dirty solution (which was definitely sketchy but not ill-intentioned), and then stuff got lost in the shuffle, and then one day it turns out that there’s a giant unanticipated liability on the balance sheet… that’s exactly how things go, all the time. I personally was at a startup which had to undergo a firesale because the accounting overlooked something. And I’ve certainly done plenty of sketchy-but-not-ill-intentioned things at startups, as quick-and-dirty solutions. The story that SBF told about what happened sounds like exactly the sort of things I’ve seen happen at startups many times before.

• On my current understanding, this is true but more general; the natural abstraction hypothesis makes narrower predictions than that.

• 21 Nov 2022 22:37 UTC
2 points
0 ∶ 0

This basically correct, other than the part about not having any guarantee that the information is in a nice format. The Maxent and Abstractions arguments do point toward a relatively nice format, though it’s not yet clear what the right way is to bind the variables of those arguments to stuff in a neural net. (Though I expect the data structures actually used will have additional structure to them on top of the maxent form.)

• 21 Nov 2022 22:22 UTC
2 points
0 ∶ 0

Meta: I’m going through a backlog of comments I never got around to answering. Sorry it took three months.

I’ve assumed it would be possible to reweight things to focus on a better distribution of data points, because it seems like there would be some very mathematically natural ways of doing this reweighting. Is this something you’ve experimented with?

Something along those lines might work; I didn’t spend much time on it before moving to a generative model.

When you say “directly applied”, what do you mean?

The actual main thing I did was to compute the SVD of the jacobian of a generative network output (i.e. the image) with respect to input (i.e. the latent vector). Results of interest:

• Conceptually, near-0 singular values indicate a direction-in-image-space in which no latent parameter change will move the image—i.e. locally-inaccessible directions. Conversely, large singular values indicate “degrees of freedom” in the image. Relevant result: if I take two different trained generative nets, and find latents for each such that they both output approximately the same image, then they both roughly agree on what directions-in-image-space are local degrees of freedom.

• By taking the SVD of the jacobian of a chunk of the image with respect to the latent, we can figure out which directions-in-latent-space that chunk of image is locally sensitive to. And then, a rough local version of the natural abstraction hypothesis would say that nonadjacent chunks of image should strongly depend on the same small number of directions-in-latent-space, and be “locally independent” (i.e. not highly sensitive to the same directions-in-latent-space) given those few. And that was basically correct.

To be clear, this was all “rough heuristic testing”, not really testing predictions carefully derived from the natural abstraction framework.

• I think these flaws point towards that when we do interpretability, we more want to impose some structure on the network. That is, we want to find some set of conditions that could occur in reality, where we can know that if these conditions occur, the network satisfies some useful property (such as “usually classifies things correctly”).

The main difficulty with this is, it requires a really good understanding of reality?

There we go!

So, one item on my list of posts to maybe get around to writing at some point is about what’s missing from current work on interpretability, what bottlenecks would need to be addressed to get the kind of interpretability we ideally want for application to alignment, and how True Names in general and natural abstraction specifically fit into the picture.

The OP got about half the picture: current methods mostly don’t have a good ground truth. People use toy environments to work around that, but then we don’t know how well tools will generalize to real-world structures which are certainly more complex and might even be differently complex.

The other half of the picture is: what would a good ground truth for interpretability even look like? And as you say, the answer involves a really good understanding of reality.

Unpacking a bit more: “interpret” is a two-part word. We see a bunch of floating-point numbers in a net, and we interpret them as an inner optimizer, or we interpret them as a representation of a car, or we interpret them as fourier components of some signal, or …. Claim: the ground truth for an interpretability method is a True Name of whatever we’re interpreting the floating-point numbers as. The ground truth for an interpretability method which looks for inner optimizers is, roughly speaking, a True Name of inner optimization. The ground truth for an interpretability method which looks for representations of cars is, roughly speaking, a True Name of cars (which presumably routes through some version of natural abstraction). The reason we have good ground truths for interpretability in various toy problems is because we already know the True Names of all the key things involved in those toy problems—like e.g. modular addition and Fourier components.

• The “most” was doing key work in that sentence you quoted.

I totally buy that antiobiotic resistance is a large and growing problem. The part which seems like obvious bullshit is the claim that the cost outweighs the benefit, or is even remotely on the same order of magnitude, especially when we’re talking about an area like sub-Saharan Africa. Do any of those studies have a cost-benefit analysis?

(Also, side note: antibiotic resistance is totally in the news regularly. Here’s one from yesterday.)

• To be clear, I don’t think the claim that self-medicated antibiotic use causes more antibiotic resistance is obvious bullshit. Maybe the effect size is close to zero outside of hospitals, maybe it’s not, but the claim isn’t obvious bullshit either way.

The “obvious bullshit” part is the (implicit) claim that the cost outweighs the benefit, or is even remotely on the same order of magnitude, especially when we’re talking about an area where the alternative is usually “don’t use antibiotics at all”.

• Object level comment: Antibiotic resistance is bad, this is likely to make it worse, probably without saving lives. You probably shouldn’t self-medicate with antibiotics, you definitely shouldn’t give them to others without knowing more about medical diagnosis.

I’ve certainly heard arguments along those lines before. They seem like obvious bullshit. Evidence: in most of the world, antibiotics are readily available over-the-counter, and yet I don’t hear about most of the world’s human-infecting bacteria becoming antibiotic resistant. Most of the world continues to use antibiotics, as a self-medication, and year after year they keep mostly working.

It seems to me like a very strong analogue to Oregon and New Jersey’s laws about pumping your own gas. Both of those states don’t allow it, which the rest of us know is completely stupid, but there’s still somehow a debate about it because lots of people make up reasons why it would be very dangerous to allow people to pump their own gas.