Co-founder @ Gladstone AI.
Contact: edouard@gladstone.ai
Website: eharr.is
Thanks very much for writing this. We appreciate all the feedback across the board, and I think this a well done and in-depth write up.
On the specific numerical thresholds in the report (i.e., your Key Proposal section), I do need to make one correction that also applies to most of Brooks’s commentary. All the numerical thresholds mentioned in the report, and particularly in that subsection, are solely examples and not actual recommendations. They are there only to show how one can calculate self-consistent licensing thresholds under the principles we recommend. They are not themselves recommendations. We had to do it this way for the same reason we propose granting fairly broad rule-setting flexibility to the regulatory entity. The field is changing so quickly that any concrete threshold risks being out of date (for one reason or the other) in very short order. We would have liked to do otherwise, but that is not a realistic expectation for a report that we expect to be digested over the course of several months.
To avoid precisely this misunderstanding, the report states in several places that those very numbers are, in fact, only examples for illustration. A few screencaps of those disclaimers are below, but there are several others. Of course we could have included even more, but beyond a certain point one is simply adding more length to what you correctly point out is already quite a sizeable document. Note that the Time article, in the excerpt you quoted, does correctly note and acknowledge that the Tier 3 AIMD threshold is there as an example (emphasis added):
the report suggests, as an example, that the agency could set it just above the levels of computing power used to train current cutting-edge models like OpenAI’s GPT-4 and Google’s Gemini.
Apart from this, I do think overall you’ve done a good and accurate job of summarizing the document and offering sensible and welcome views, emphasis, and pushback. It’s certainly a long report, so this is a service to anyone who’s looking to go one or two levels deeper than the Executive Summary. We do appreciate you giving it a look and writing it up.
I think what gwern is trying to say is that continuous progress on a benchmark like PTB appears (from what we’ve seen so far) to map to discontinuous progress in qualitative capabilities, in a surprising way which nobody seems to have predicted in advance. Qualitative capabilities are more relevant to safety than benchmark performance is, because while qualitative capabilities include things like “code a simple video game” and “summarize movies with emojis”, they also include things like “break out of confinement and kill everyone”. It’s the latter capability, and not PTB performance, that you’d need to predict if you wanted to reliably stay out of the x-risk regime — and the fact that we can’t currently do so is, I imagine, what brought to mind the analogy between scaling and Russian roulette.
I.e., a straight line in domain X is indeed not surprising; what’s surprising is the way in which that straight line maps to the things we care about more than X.
(Usual caveats apply here that I may be misinterpreting folks, but that is my best read of the argument.)
It’s simply because we each (myself more than her) have an inclination to apply a fair amount of adjustment in a conservative direction, for generic “burden of proof” reasons, rather than go with the timelines that seem most reasonable based on the report in a vacuum.
While one can sympathize with the view that the burden of proof ought to lie with advocates of shorter timelines when it comes to the pure inference problem (“When will AGI occur?”), it’s worth observing that in the decision problem (“What should we do about it?”) this situation is reversed. The burden of proof in the decision problem probably ought instead to lie with advocates of non-action: when one’s timelines are >1 generation, it is a bit too easy to kick the can down the road in various ways — leaving one unprepared if the future turns out to move faster than we expected. Conversely someone whose timelines are relatively short may take actions today that will leave us in a better position in the future, even if that future arrives more slowly than they believed originally.
(I don’t think OpenPhil is confusing these two, just that in a conversation like this it is particularly worth emphasizing the difference.)
Thanks, that helps. So actually this objection says: “No, the biggest risk lies not in the trustworthiness of the Bob you use as the input to your scheme, but rather in the fidelity of your copying process; and this is true even if the errors in your copying process are being introduced randomly rather than adversarially. Moreover, if you actually do develop the technical capability to reduce your random copying-error risk down to around the level of your Bob-trustworthiness risk, well guess what, you’ve built yourself an AGI. But since this myopic copying scheme thing seems way harder than the easiest way I can think of to build an AGI, that means a fortiori that somebody else built one the easy way several years before you built yours.”
Is that an accurate interpretation?
Thanks for writing this.
I have one point of confusion about some of the notation that’s being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.
First, is being defined here as an outcome permutation. Presumably this means that 1) for some , ; and 2) admits a unique inverse . That makes sense.
We also define lotteries over outcomes, presumably as, e.g., , where is the probability of outcome . Of course we can interpret the geometrically as mutually orthogonal unit vectors, so this lottery defines a point on the -simplex. So far, so good.
But the thing that’s confusing me is what this implies for the definition of . Because is defined as a permutation over outcomes (and not over probabilities of outcomes), we should expect this to be
The problem is that this seems to give a different EV from the lemma:
(Note that I’m using as the dummy variable rather than , but the LHS above should correspond to line 2 of the proof.) Doing the same thing for the lottery gives an analogous result. And then looking at the inequality that results suggests that lemma 3 should actually be ” induces ” as opposed to ” induces ”.
(As a concrete example, suppose we have a lottery with the permutation , , . Then and our EV is
Yet which appears to contradict the lemma as stated.)
Note that even if this analysis is correct, it doesn’t invalidate your main claim. You only really care about the existence of a bijection rather than what that bijection is — the fact that your outcome space is finite ensures that the proportion of orbit elements that incentivize power seeking remains the same either way. (It could have implications if you try to extend this to a metric space, though.)
Again, it’s also possible I’ve just misunderstood something here — please let me know if that’s the case!
Nice. Congrats on the launch! This is an extremely necessary line of effort.
Hey Daniel — thanks so much for taking the time to write this thoughtful feedback. I really appreciate you doing this, and very much enjoyed your “2026” post as well. I apologize for the delay and lengthy comment here, but wanted to make sure I addressed all your great points.
1. It would be great if you could pepper your story with dates, so that we can construct a timeline and judge for ourselves whether we think things are happening too quickly or not.
I’ve intentionally avoided referring to absolute dates, other than by indirect implication (e.g. “iOS 19”). In writing this, I was more interested in exploring how a plausible technical development model might interact with the cultural and economic contexts of these companies. As a result I decided to focus on a chain of events instead of a timeline.
But another reason is that I don’t feel I know enough to have a strong view on dates. I do suspect we have been in an overhang of sorts for the past year or so, and that the key constraints on broad-based development of scaled models up to this point have been institutional frictions. It takes a long time to round up the internal buy-in you need for an investment at this scale, even in an org that has a technical culture, and even if you have a committed internal champion. And that means the pace of development immediately post-GPT3 is unusually dependent on random factors like the whims of decision-makers, and therefore has been/will be especially hard to predict.
(E.g., how big will Google Pathways be, in terms of scale/compute? How much capex committed? Nobody knows yet, as far as I can tell. As a wild guess, Jeff Dean could probably get a $1B allocation for this if he wanted to. Does he want $1B? Does he want $10B? Could he get $10B if he really pushed for it? Does the exec team “get it” yet? When you’re thinking in terms of ROI for something like this, a wide range of outcomes is on the table.)
2. Auto-generated articles and auto-generated videos being so popular that they crowd out most human content creators… this happens at the beginning of the story? I think already this is somewhat implausible and also very interesting and deserves elaboration. Like, how are you imagining it: we take a pre-trained language model, fine-tune it on our article style, and then let it loose using RL from human feedback (clicks, ad revenue) to learn online? And it just works? I guess I don’t have any arguments yet for why that shouldn’t work, but it seems intuitively to me that this would only work once we are getting pretty close to HLAGI / APS-AI. How big are these models in your story? Presumably bigger than GPT-3, right, since even a fine-tuned GPT-3 wouldn’t be able to outperform human content creators (right?). And currently video generation tech lags behind text generation tech.
The beginning of the story still lies in our future, so to be clear, this isn’t a development I’d necessarily expect immediately. I am definitely imagining an LM bigger than GPT-3, but it doesn’t seem at all implausible that ByteDance would build such an LM on, say, a 24-month timeframe from today. They certainly have the capital for it, and the company has a history of favoring algorithmic recommendations and AI over user-driven virality — so particularly in Toutiao’s case, this would be a natural extension of their existing content strategy. And apart from pure scale, the major technical hurdle for auto-generated articles seems like it’s probably the size of the attention window, which people have been making notable progress on this recently.
I’d say the “it just works” characterization is not quite right — I explicitly say that this system takes some time to fine tune even after it’s first deployed in production. To elaborate a bit, I wouldn’t expect any training based on human feedback at first, but rather something more like manual screening/editing of auto-generated articles by internal content teams. That last part is not something I said explicitly in the text; maybe I should?
I think your point about video is a great critique though. It’s true that video has lagged behind text. My thinking here was that the Douyin/TikTok form factor is an especially viable setting to build early video gen models: the videos are short, and they already have a reliable reward model available in the form of the existing rec algorithm. But even though this might be the world’s best corpus to train on, I do agree with you that there is more fundamental uncertainty around video models. I’d be interested in an further thoughts you might have on this point.
One question on this part: what do you mean by “APS-AI”?
3. “Not long after, Google rocks the tech industry with a major announcement at I/O. They’ve succeeded in training a deep learning model to completely auto-generate simple SaaS software from a natural-language description. ” Is this just like Codex but better? Maybe I don’t what SaaS software is.
Yes, pretty much just Codex but better. One quick-and-dirty way to think of SaaS use cases is: “any business workflow that touches a spreadsheet”. There are many, many, many such use cases.
4. “At first, the public is astonished. But after nothing more is heard about this breakthrough for several months, most eventually dismiss it as a publicity stunt. But one year later, Google launches an improved version of the model in a new Search widget called “synthetic SaaS”.”—I didn’t successfully read between the lines here, what happened in that quiet year?
Ah this wasn’t meant to be subtle or anything, just that it takes time to go from “prototype demo” to “Google-scale production rollout”. Sorry if that wasn’t clear.
5. “The S&P 500 doubles that year, driven by explosive growth in the big-cap tech stocks. Unemployment claims reach levels not seen since the beginning of the Covid crisis.” Why is unemployment so high? So far it seems like basic programming jobs have been automated away, and lots of writing and video generation jobs. But how many jobs are those? Is it enough to increase unemployment by a few percent? I did some googling and it seems like there are between 0.5 and 1 million jobs in the USA that are like this, though I’m not at all confident. (there are 0.25M programmer jobs) More than a hundred million total employed, though. So to make unemployment go up by a couple percent a bunch of other stuff would need to be automated away besides the stuff you’ve mentioned, right?
You’re absolutely right. I was imagining some additional things happening here which I didn’t put into the story and therefore didn’t think through in enough detail. I’d expect unemployment to increase, but not necessarily to this extent or on these timescales. Will delete this sentence — thanks!
6. “At the end of that year, the stock market once again delivers astronomical gains. Yet, curiously, the publicly disclosed performance of hedge funds — particularly of the market-neutral funds that trade most frequently — consists almost entirely of losses.” I take it this is because several tech companies are secretly using AI to trade? Is that legal? How would they be able to keep this secret?
Good question. I don’t actually expect that any tech companies would do this. While it could strictly speaking be done in a legal way, I can’t imagine the returns would justify the regulatory and business-relationship risk. More to the point, big tech cos already own money machines that work, and that have even better returns on capital than market trading from an unleveraged balance sheet would.
My implication here is rather that other hedge funds enter the market and begin trading using sophisticated AIs. Hedge funds aren’t required to disclose public returns, so I’m imagining that one or more of these funds have entered the market without disclosure.
7. You have a section on autonomous drones. Why is it relevant? Is the implication that they are going to be used by the AI to take over? The last section makes it seem like the AI would have succeeded in taking over anyway, drones or no. Ditto for the USA’s self-improving cyberwar software.
Great observation. I was debating whether to cut this part, actually. I kept it because 1) it motivated the plot later, when OpenAI debates whether to build in an explicit self-improvement mechanism; and 2) it felt like I should tell some kind of story about military applications. But given how I’m actually thinking about self-improvement and the risk model (see 9 and 12, below) I think this can be cut with little loss.
8. “Codex 4 is expected to cost nearly a billion dollars in compute alone.” This suggests that all the AIs made so far cost less than that? Which means it’s, like, not even 2025 yet according to Ajeya’s projection?
Oh yeah, you’re totally right and this is a major error on my part. This should be more like $10B+. Will edit!
9. “After a rigorous internal debate, it’s also decided to give Codex 4 the ability to suggest changes to its own codebase during training, in an attempt to maximize performance via architectural improvements in the model.” I thought part of the story here was that more complex architectures do worse? Are you imagining that Codex 4 discovers simpler architectures? By the way, I don’t think that’s a plausible part of the story—I think even if the scaling hypothesis and bitter lesson are true, it’s still the case that more complex, fiddly architectures help. It’s just that they don’t help much compared to scaling up compute.
I agree that the bitter lesson is not as straightforward as “complex architectures do worse”, and I also agree with you that fiddly architectures can do better than simple ones. But I don’t really believe the kinds of fiddly architectures humans will design are likely to perform better than our simplest architectures at scale. Roughly speaking, I do not believe we are smart enough to approach this sort of work with the right assumptions to design good architectures, and under those conditions, the fewer assumptions we embed in our architectures, the better.
I do believe that the systems we build will be better at designing such architectures than we are, though. And that means there is indeed something to be gained from fiddly architectures — just not from “human-fiddly” ones. In fact, you can argue that this is what meta-learning does: a system that meta-learns is one that redesigns its own architecture, in some sense. And actually, articulating it that way suggests that this kind of self-improvement is really just the limit case of meta-learning — which in turn makes the explicit self-improvement scheme in my story redundant! So yep, I think this gets cut too. :)
10. “This slows down the work to a crawl and multiplies the expense by an order of magnitude, but safety is absolutely paramount.” Why is Microsoft willing to pay these costs? They don’t seem particularly concerned about AI risk now, are you imagining this changes in the next 4 years? How does it change? Is it because people are impressed by all the AI progress and start to listen to AI safety people?
There is no “canon” reason why they are doing this — I’m taking some liberties in this direction because I don’t expect the kinds of safety precautions they are taking to matter much. However I do expect that alignment will soon become an obvious limiting factor in getting big models to do what we want, and it doesn’t seem too unreasonable to expect this might be absorbed as a more general lesson.
11. Also, if it’s slowing the work to a crawl and multiplying the expense, shouldn’t Microsoft/OpenAI be beaten to the punch by some other company that isn’t bothering with those precautions? Or is the “market” extremely inefficient, so to speak?
The story as written is intentionally consistent with OpenAI being beaten to the punch by a less cautious company. In fact, I consider that the more plausible failure scenario (see next point) even though the text strongly implies otherwise.
Still, it’s marginally plausible that nobody was yet willing to commit funds on that scale at the time of the project — and in the world of this story, that’s indeed what happened. Relatively few organizations have the means for something like this, so that does make the market less efficient than it would be if it had more viable participants.
12. “Not long after this, the world ends.” Aaaaagh tell me more! What exactly went wrong? Why did the safety techniques fail? (To be clear, I totally expect that the techniques you describe would fail. But I’m interested to hear your version of the story.)
Yeah, I left this deliberately ambiguous. The reason is that I’m working from a risk model that I’m a bit reluctant to publicize too widely, since it feels like there is some chance that the publication itself might be slightly risky. (I have shared it privately with a couple of folks though, and would be happy to follow up with you on this by DM — please let me know if you’re interested.) As a result, while I didn’t want to write a story that was directly inconsistent with my real risk model, I did end up writing a story that strongly implies an endgame scenario which I don’t actually believe is very likely (i.e., “OpenAI carefully tries to train an aligned AI but it blows up”).
Honestly I wasn’t 100% sure how to work around this problem — hence the ambiguity and the frankly kludgy feel of the OpenAI bit at the end. But I figured the story itself was worth posting at least for its early development model (predicated on a radical version of connectionism) and economic deployment scenario (predicated on earliest rollouts in environments with fastest feedback cycles). I’d be especially interested in your thoughts on how to handle this, actually.
13. Who is Jessica? Is she someone important? If she’s not important, then it wouldn’t be worth a millisecond delay to increase success probability for killing her.
Jessica is an average person. The AI didn’t delay anything to kill her; it doesn’t care about her. Rather I’m intending to imply that whatever safety precautions were in place to keep the AI from breaking out merely had the effect of causing a very small time delay.
14. It sounds like you are imagining some sort of intelligence explosion happening in between the Codex 4 section and the Jessica section. Is this right or a misinterpretation?
Yes that is basically right.
Thanks again Daniel!
UPDATE: Made several changes to the post based on this feedback.
which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the “same agent” in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment
I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agent that is executing some policy ; and on the other side we have an environment that consists of state transition dynamics given by some distribution . One can in fact show (see the unidentifiability in IRL paper) that if an experimenter has the power to vary the environment arbitrarily and look at the policies the agent pursues on each of those environments, then that experimenter can recover a reward function that is unique up to the usual affine transformations.
That recovered reward function is a fortiori a reliable invariant of the agent, since it is consistent with the agent’s actions under every possible environment the agent could be exposed to. (To be clear, this claim is also proved in the paper.) It also seems reasonable to identify that reward function with the mesa-objective of the agent, because any mesa-objective that is not identical with that reward function has to be inconsistent with the agent’s actions on at least one environment.
Admittedly there are some technical caveats to this particular result: off the top, 1) the set of states & actions is fixed across environments; 2) the result was proved only for finite sets of states & actions; and 3) optimal policy is assumed. I could definitely imagine taking issue with some of these caveats — is this the sort of thing you mean? Or perhaps you’re skeptical that a proof like this in the RL setting could generalize to the train/test framing we generally use for NNs?
in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments.
Yeah that’s sensible because this is often all you can do in practice. Having an omnipotent experimenter is rarely realistic, but imo it’s still useful as a way to bootstrap a definition of the mesa-objective.
Btw, if you’re aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can’t carve up a world, or recover a consistent utility function through this sort of process — please let me know. I’m directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
Thanks!!
Loved this post. This whole idea of using a deterministic dynamical system as a conceptual testing ground feels very promising.
A few questions / comments:
About the examples: do you think it’s strictly correct to say that entropy / death is an optimizing system? One of the conditions of the Flint definition is that the set of target states ought to be substantially smaller than the basin of attraction, by some measure on the configuration space. Yet neither high entropy nor death seem like they satisfy this: there are too many ways to be dead, and (tautologically) too many ways to have high entropy. As a result, both the “dead” property and the “high-entropy” property make up a large proportion of the attraction basin. The original post makes a similar point, though admittedly there is some degree of flexibility in terms of how big the target state set has to be before you call the system an optimizer.
Not sure if this is a useful question, but what do you think of using “macrostate” as opposed to “property” to mean a set of states? This term “macrostate” is used in statistical physics for the identical concept, and as you’re probably aware, there may be results from that field you’d be able to leverage here. (The “size” of a macrostate is usually thought of as its entropy over states, and this seems like it could fit into your framework as well. At first glance it doesn’t seem too unreasonable to just use a flat prior over grid configurations, so this just ends up being the log of the state count.)
I like the way embedded perturbations have been defined too. External perturbations don’t seem fundamentally different from embedded ones (we can always just expand our configuration space until it includes the experimenter) but keeping perturbations “in-game” cuts out those complications while keeping the core problem in focus.
The way you’re using and as a way to smoothly vary the “degree” of optimization of a system is very elegant.
Do you imagine keeping the mask constant over the course of a computational rollout? Plausibly as you start a computation some kinds of agents may start to decohere as they moves outside the original mask area and/or touch and merge with bits of their environments. E.g., if the agent is a glider, does the mask “follow” the agent? Or are you for now mostly considering patterns like eaters that stay in one place?
Thanks so much for the feedback!
The ability to sort by model size etc would be nice. Currently sorting is alphabetical.
Right now the default sort is actually chronological by publication date. I just added the ability to sort by model size and compute budget at your suggestion. You can use the “⇅ Sort” button in the Models tab to try it out; the rows should now sort correctly.
Also the rows with long textual information should be more to the right and the more informative/tighter/numerical columns more to the left (like “deep learning” in almost all rows, not very informative). Ideally the most relevant information would be on the initial page without scrolling.
You are absolutely right! I’ve just taken a shot at rearranging the columns to surface the most relevant parts up front and played around a bit with the sizing. Let me know what you think.
“Date published” and “date trained” can be quite different. Maybe worth including the latter?
That’s true, though I’ve found the date at which a model was trained usually isn’t disclosed as part of a publication (unlike parameter count and, to a lesser extent, compute cost). There is also generally an incentive to publish fairly soon after the model’s been trained and characterized, so you can often rely on the model not being that stale, though that isn’t universal.
Is there a particular reason you’d be interested in seeing training dates as opposed to (or in addition to) publication dates?
Thanks again!
Yeah, these are interesting points.
Isn’t it a bit suspicious that the thing-that’s-discontinuous is hard to measure, but the-thing-that’s-continuous isn’t? I mean, this isn’t totally suspicious, because subjective experiences are often hard to pin down and explain using numbers and statistics. I can understand that, but the suspicion is still there.
I sympathize with this view, and I agree there is some element of truth to it that may point to a fundamental gap in our understanding (or at least in mine). But I’m not sure I entirely agree that discontinuous capabilities are necessarily hard to measure: for example, there are benchmarks available for things like arithmetic, which one can train on and make quantitative statements about.
I think the key to the discontinuity question is rather that 1) it’s the jumps in model scaling that are happening in discrete increments; and 2) everything is S-curves, and a discontinuity always has a linear regime if you zoom in enough. Those two things together mean that, while a capability like arithmetic might have a continuous performance regime on some domain, in reality you can find yourself halfway up the performance curve in a single scaling jump (and this is in fact what happened with arithmetic and GPT-3). So the risk, as I understand it, is that you end up surprisingly far up the scale of “world-ending” capability from one generation to the next, with no detectable warning shot beforehand.
“No one predicted X in advance” is only damning to a theory if people who believed that theory were making predictions about it at all. If people who generally align with Paul Christiano were indeed making predictions to the effect of GPT-3 capabilities being impossible or very unlikely within a narrow future time window, then I agree that would be damning to Paul’s worldview. But—and maybe I missed something—I didn’t see that. Did you?
No, you’re right as far as I know; at least I’m not aware of any such attempted predictions. And in fact, the very absence of such prediction attempts is interesting in itself. One would imagine that correctly predicting the capabilities of an AI from its scale ought to be a phenomenally valuable skill — not just from a safety standpoint, but from an economic one too. So why, indeed, didn’t we see people make such predictions, or at least try to?
There could be several reasons. For example, perhaps Paul (and other folks who subscribe to the “continuum” world-model) could have done it, but they were unaware of the enormous value of their predictive abilities. That seems implausible, so let’s assume they knew the value of such predictions would be huge. But if you know the value of doing something is huge, why aren’t you doing it? Well, if you’re rational, there’s only one reason: you aren’t doing it because it’s too hard, or otherwise too expensive compared to your alternatives. So we are forced to conclude that this world-model — by its own implied self-assessment — has, so far, proved inadequate to generate predictions about the kinds of capabilities we really care about.
(Note: you could make the argument that OpenAI did make such a prediction, in the approximate yet very strong sense that they bet big on a meaningful increase in aggregate capabilities from scale, and won. You could also make the argument that Paul, having been at OpenAI during the critical period, deserves some credit for that decision. I’m not aware of Paul ever making this argument, but if made, it would be a point in favor of such a view and against my argument above.)
This is a great thread. Let me see if I can restate the arguments here in different language:
Suppose Bob is a smart guy whom we trust to want all the best things for humanity. Suppose we also have the technology to copy Bob’s brain into software and run it in simulation at, say, a million times its normal speed. Then, if we thought we had one year between now and AGI (leaving aside the fact that I just described a literal AGI in the previous sentence), we could tell simulation-Bob, “You have a million subjective years to think of an effective pivotal act in the real world, and tell us how to execute it.” Bob’s a smart guy, and we trust him to do the right thing by us; he should be able to figure something out in a million years, right?
My understanding of Evan’s argument at this point would be: “Okay; so we don’t have the technology to directly simulate Bob’s brain. But maybe instead we can imitate its I/O signature by training a model against its actions. Then, because that model is software, we can (say) speed it up a million times and deal with it as if it was a high-fidelity copy of Bob’s brain, and it can solve alignment / execute pivotal action / etc. for us. Since Bob was smart, the model of Bob will be smart. And since Bob was trustworthy, the model of Bob will be trustworthy to the extent that the training process we use doesn’t itself introduce novel long-term dependencies that leave room for deception.”
Note that myopia — i.e., the purging of long term dependencies from the training feedback signal — isn’t really conceptually central to the above scheme. Rather it is just a hack intended to prevent additional deception risks from being introduced through the act of copying Bob’s brain. The simulated / imitated copy of Bob is still a full-blown consequentialist, with all the manifold risks that entails. So the scheme is basically a way of taking an impractically weak system that you trust, and overclocking it but not otherwise affecting it, so that it retains (you hope) the properties that made you trust it in the first place.
At this point my understanding of Eliezer’s counterargument would be: “Okay sure; but find me a Bob that you trust enough to actually put through this process. Everything else is neat, but it is downstream of that.” And I think that this is correct and that it is a very, very strong objection, but — under certain sets of assumptions about timelines, alternatives, and counterfactual risks — it may not be a complete knock-down. (This is the “belling the cat” bit, I believe.)
And at this point, maybe (?) Evan says, “But wait; the Bob-copy isn’t actually a consequentialist because it was trained myopically.” And if that’s what Evan says, then I believe this is the point at which there is an empirically resolvable disagreement.
Is this roughly right? Or have I missed something?