# Zac Hatfield Dodds(Zac Hatfield-Dodds)

Karma: 610

Researcher at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning /​ testing /​ verification), open sourcerer, more at zhd.dev

• “maximize the expected amount of money in this bank account until 1 day from now, then maximize the probability of shutting yourself off”… might cause a lot of damage, but it probably wouldn’t end the world in pursuit of these objectives.

This is not actually a limited-horizon agent; you’ve just set a time at which it changes objectives. And wouldn’t ending the world be the most reliable way to ensure pesky humans never turn you back on?...

(unfortunately thinking about constraints you can place on an unaligned agent never leads anywhere useful; alignment is the only workable solution in the long term)

• Probably since prehistory and certainly since antiquity we’ve had some ‘mesa’/​‘runtime’ understanding of heritability, in contrast to (presumably) all other animals.

No, not so much. See e.g. https://​​www.gwern.net/​​reviews/​​Bakewell

Like anything else, the idea of “breeding” had to be invented. That traits are genetically-influenced broadly equally by both parents subject to considerable randomness and can be selected for over many generations to create large average population-wide increases had to be discovered the hard way, with many wildly wrong theories discarded along the way. Animal breeding is a case in point, as reviewed by an intellectual history of animal breeding, Like Engend’ring Like, which covers mistaken theories of conception & inheritance from the ancient Greeks to perhaps the first truly successful modern animal breeder, Robert Bakewell (1725–1795).

Why did it take thousands of years to begin developing useful animal breeding techniques, a topic of interest to almost all farmers everywhere, a field which has no prerequisites such as advanced mathematics or special chemicals or mechanical tools, and seemingly requires only close observation and patience? … What is most interesting is the intellectual history we can extract from it in terms of inventing heritability and as important, one of the inventions of progress in the gradual realization that selective breeding was even possible.

• I enjoyed the whole paper! It’s just that “read sections 1 through 8” doesn’t reduce the length much, and 5-6 have some nice short results that can be read alone :-)

• The paper is really only 28 pages plus lots of graphs in the appendices! If you want to skim, I’d suggest just reading the abstract and then sections 5 and 6 (pp 16--21). But to summarize:

• Do neural networks learn the same concepts as humans, or at least human-legible concepts? A “yes” would be good news for interpretability (and alignment). Let’s investigate AlphaZero and Chess as a case study!

• Yes, over the course of training AlphaZero learns many concepts (and develops behaviours) which have clear correspondence with human concepts.

• Low-level /​ ground-up interpretability seems very useful here. Learned summaries are also great for chess but rely on a strong ground-truth (e.g. “Stockfish internals”).

• Details about where in the network and when in the training process things are represented and learned.

The analysis of differences between the timing and order of developments in human scholarship and AlphaZero training is pretty cool if you play chess; e.g. human experts diversify openings (not just 1.e4) since 1700 while AlphaZero narrows down from random to pretty much the modern distribution over GM openings; AlphaZero tends to learn material values before positions and standard openings.

• On a quick skim it looks like that fails both “not equivalent to executing the model” and the float32 vs problem.

It’s a nice approach, but I’d also be surprised if it scales to maintain tight bounds on much larger networks.

• Ah, crux: I do think the floating-point matters! Issues of precision, underflow, overflow, and NaNs bedevil model training and occasionally deployment-time behavior. By analogy, if we deploy an AGI the ideal mathematical form of which is aligned we may still be doomed, even it’s plausibly our best option in expectation.

Checkable meaning that I or someone I trust with this has to be able to check it! Maxwell’s proposal is simple enough that I can reason through the whole thing, even over float32 rather than , but for more complex arguments I’d probably want it machine-checkable for at least the tricky numeric parts.

• First, an apology: I didn’t mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I’m sorry I wasn’t clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/​empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don’t think it’s close to sufficient.

• I note that this seems pretty strongly disanalogous to any kind of safety proof we care about, and doesn’t generalise to any reasoning about nonzero sensitivity. That said your assumptions seem fair to me, and I’d be happy to pay out the prize if this actually works.

• I have an optional internal monologue, and programming or playing strategy games is usually a non-verbal exercise.

I’m sure you could in principle (though not as described!) map neuron firings to a strongly predictive text stream regardless, but I don’t think that would be me. And the same intuition says it would be possible for MuZero; this is about the expressiveness of text rather than monologue being a key component of cognition or identity. Conversely, I would expect this to go terribly wrong when the tails come apart, because we’re talking about correlates rather than causal structures, with all the usual problems.

• I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I’m less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don’t know how to structure this or convince someone who thinks it would work. I am therefore offering a \$1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/​or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.

I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.

And doing this wouldn’t actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system—and cache and timing attacks like Spectre probably aren’t yet either. Your proof system isn’t large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be .

I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I’d be comfortable relying on.

• The question is whether that decline in effectiveness is because of declining immune response (in which case an identical booster would help) or a shifting distribution of influenza strains—in which case you’d need a different shot. Of course it’s likely to be a mixture, but my understanding is that it’s mostly the latter.

• The (Old Order) Amish are the closest precedent I can think of—very strong maintenance of tradition, internal trade and a little external but tightly constrained by their values and eschewing modern technology, and without external assistance beyond a live-and-let-live approach.