They were not; computers were programmable long before that. Before the 4004, the functionality that we today find on a CPU were distributed over a larger collection of circuitry, with different separate components for the ALU and the instruction interpreter and the register bank and the memory controller and such. But that assembly was already programmable and functioned as a Turing universal computer since the late 40s. The innovation of the intel 4004 was that it was the first design that had all that machinery on a single integrated chip (the first CPU as we might understand it today, in the sense of being the first central processing unit—earlier designs were decentralized, though the term “CPU” was already in use before then).
redlizard
One overriding perspective in Max and my approach is that we need to design our systems so that their safety can be formally verified. In software, people often bring up the halting problem as an argument that general software can’t be verified. But we don’t need to verify general software, we are designing our systems so that they can be verified.
I am a great proponent of proof-carrying code that is designed and annotated for ease of verification as a direction of development. But even from that starry-eyed perspective, the proposals that Andrew argues against here seem wildly unrealistic.
A proof-carrying piece of C code can prove that it is not vulnerable to any buffer overflows, or that it will never run motor 1 and motor 2 simultaneously in opposite directions. A bit more ambitiously, it could contain a full proof of a complete behavioral specification, proving that the software will perform a certain set of behaviors at all times, which as a corollary also implies that it is proof against a large body of security threats. This is not something we can really manage in practice yet, but it’s well within the extrapolated range of current techniques. We could build a compiler that will only compile software that contains a proof matching some minimum necessary specification, too.
Now imagine trying to prove that my program doesn’t perform any sequence of perfectly innocuous interactions that happens to trigger a known or unknown bug in a mysql 5.1.32 server on the internet somewhere. How would you specify that? You can specify that the program doesn’t do anything that might affect anything ever (though this leads to the well-known boxing problems); but if I want my program to have some nonzero ability to affect things in non-malicious ways, how would you specify that it doesn’t do anything that might break something in some other part of the world in some unexpected way, including any unknown zero-days in mysql 5.1.32 servers? Presumably my proof checker doesn’t contain the full code of all other systems on the internet my software might plausibly interact with. How could I specify a requirement like this, let alone prove it?
Proving that a piece of software has the behavior I want, or the behavior I allow, is something that can be done by carefully annotating my code with lemmas and contracts and incremental assumptions that together build up to a proof of the behavior I want. Proving that the software will have the behavior I want no matter what conditions you could possibly throw at it sounds harder but is actually mostly the same problem—and so I would expect that proofs of almost-perfect security of a piece of software would not be that difficult either. But that is security of the software against attacks that might threaten the desired behavior of the software. Demonstrating that the software is not a threat to something else somewhere is another matter entirely, as this requires first knowing and encoding in the proofs all the ways in which the rest of the world might be negatively affected by actions-in-general. Not just superficially, either, if you want the proof to rule out the mysql 5.1.32 zero-day that just uses a sequence of normal interactions that should be perfectly innocent but in practice aren’t. This is proving a negative in the worst possible way; to prove something like this, you would need to quantify over all possible higher-order behaviors the program doesn’t have. I don’t see how any of the real-or-imagined formal verification techniques I have ever heard about could possibly do anything even vaguely like this.
All the above is the world of software versus software, where even the external environment can be known to the last bit, with crisp behavior that can be analyzed and perfectly simulated even to the point where you can reproduce the mysql 5.1.32 bugs. Doing the same thing with biology would be a whole other order of magnitude in challenge level. Proving that a certain piece of synthesized DNA is not going to form a threat to human biology in some indirect way is essentially analogous to the mysql exploitation problem above, but much, much harder. Here, too, you would need to have a proof that quantifies over all possible higher order behaviors the piece of DNA doesn’t have, all applied to the minor inconvenience that is the much murkier world of biology.
Unless I missed some incredible new developments in the field of formal verification, quantification over all possible higher order patterns and consequences of a system being analyzed is something that is absolutely out of reach of any formal verification techniques I have heard of. But if you do know of any techniques that could challenge this barrier, do tell :)
This seems very confused.
What makes good art a subjective quality is that its acceptance criterion is one that refers to the viewer as one of its terms. The
is-good-art()predicate, or theart-quality()real-valued function, has aviewerparameter in it. What makes good physics-theory an objective quality is that its acceptance criterion doesn’t refer to the viewer; theis-good-physics-theory()predicate, or thephysics-theory-accuracy()real-valued function, is one that compares the theory to reality, without the viewer playing a role as a term inside the function.Sure, both of these functions are in the end computed by human brains, which adds a level of subjectivity to the imperfect physical act of actually evaluating these functions. But that doesn’t mean that the ideal things-that-these-brains-are-supposedly-evaluating are themselves subjective. A human brain evaluating how accurate a certain physics theory is results in a subjective assessment of an objective truth; a human brain evaluating whether a certain painting is art results in a subjective assessment of a subjective property. The act of assessment by an imperfect brain adds a layer of subjectivity over something that may or may not be objective in the abstract; but that abstract ideal of that which these brains are supposedly evaluating has a real difference in kind that is well beyond mere consensus.
Modifiers and subjectivity-affecting operations can be applied to both objective and subjective criteria, of course. The degree to which a theory of physics reflects reality is an objective measure; the degree to which a reader likes a theory is a subjective measure. The degree to which a viewer considers a painting to be art is a subjective measure; the degree to which the average human viewer considers a painting to be art is an objective measure, because the
viewerparameter has been aggregated out. But these complications only obscure the basic distinction, they do not fundamentally challenge it.
This was my experience studying in the Netherlands as well. University officials were indeed on board with this, with the general assumption being that lectures and instructions and labs and such are a learning resource that you can use or not use at your discretion.
I would say that some formal proofs are actually impossible
Plausible. In the aftermath of spectre and meltdown I spent a fair amount of time thinking on how you could formally prove a piece of software to be free of information-leaking side channels, even assuming that the same thing holds for all dependent components such as underlying processors and operating systems and the like, and got mostly nowhere.
In fact, I think survival timelines might even require anyone who might be working on classes of software reliability that don’t relate to alignment to actually switch their focus to alignment at this point.
Does that include those working on software correctness and reliability in general, without a security focus? I would expect better tools for making software that is free of bugs, such as programs that include correctness proofs as well as some of the lesser formal methods, to be on the critical path to survival—for the simple reason that any number of mundane programming mistakes in a supposedly-aligned AI could easily kill us all. I was under the impression that you agree with this [“Assurance Requires Formal Proofs”]. I expect formal proofs of security in particular to be largely a corollary of this—a C program that is proven to correctly accomplish any particular goal will necessarily not have any buffer overflows in it, for this would invoke undefined behavior which would make your proof not go through. This does not necessarily apply to all security properties, but I would expect it to apply to most of them.
Yes, I agree with this.
I cannot judge to what degree I agree with your strategic assessment of this technique, though. I interpreted your top-level post as judging that assurances based on formal proofs are realistically out of reach as a practical approach; whereas my own assessment is that making proven-correct [and therefore proven-secure] software a practical reality is a considerably less impossible problem than many other aspects of AI alignment, and indeed one I anticipate to actually happen in a timeline in which aligned AI materializes.
Assurance Requires Formal Proofs, Which Are Provably Impossible
The Halting Problem puts a certain standard of formalism outside our reach
This is really not true. The halting problem only makes it impossible to write a program that can analyze a piece of code and then reliably say “this is secure” or “this is insecure”. It is completely possible to write an analyzer that can say “this is secure” for some inputs, “this is definitely insecure for reason X” for some other inputs, and “I am uncertain about your input so please go improve it” for everything in between. In particular, it is completely possible to have a machine-checkable proof system going along with executable code that can express proofs of extremely strong security properties for almost every program you might wish to run in practice, which can then judge “I can confirm this is secure” or “I can not confirm that this is secure which may or may not indicate an actual problem so go fix it”.
Pulling this off in practice is still fiendishly difficult, of course, and progress in this field has been frustratingly slow. But there is no theoretical reason to suspect that this is fundamentally out of reach. (Or at least, not in the halting problem; Löb’s theorem does provide some real limitations here that are particularly relevant for AI correctness proofs. But that is fairly niche relative to the broader notion of software correctness and security proofs.)
We see this a lot on major events, as I’ve noted before, like the Super Bowl or the World Cup. If you are on the ball, you’ll bet somewhat more on a big event, but you won’t bet that much more on it than on other things that have similarly crazy prices. So the amount of smart money does not scale up that much. Whereas the dumb money, especially the partisans and gamblers, come out of the woodwork and massively scale up.
This sounds like it should generalize to “in big events, especially those on which there are vast numbers of partisans on both sides, the prediction markets will reliably be insane and therefore close to useless for prediction purposes”. Is that something one sees in practice for things like the World Cup?
One way the World Cup and Super Bowl differ from the election is that in a championship match, only a modest fraction of the people interested in the tournament as a whole will be partisans for the two contestants participating in the championship match; whereas in the election, I would expect the partisan coefficient to be much higher than that. Would that affect the degree to which the dumb money will overwhelm the smart money? I would expect so.
I do not think you are selling a strawman, but the notion that a utility function should be computable seems to me to be completely absurd. It seems like a confusion born from not understanding what computability means in practice.
Say I have a computer that will simulate an arbitrary Turing machine T, and will award me one utilon when that machine halts, and do nothing for me until that happens. With some clever cryptocurrency scheme, this is a scenario I could actually build today. My utility function ought plausibly to have a term in it that assigns a positive value to the computer simulating a halting Turing machine, and zero to the computer simulating a non-halting Turing machine. Yet the assumption of utility function computability would rule out this very sensible desire structure.
If I live in a Conway’s Game of Life universe, there may be some chunk of universe somewhere that will eventually end up destroying all life (in the biological sense, not the Game of Life sense) in my universe. I assign lower utility to universes where this is the case, than to those were it is not. Is that computable? No.
More prosaically, as far as I currently understand, the universe we actually live in seems to be continuous in nature, and its state may not be describable even in principle with a finite number of bits. And even if it is, I do not actually know this, which means my utility function is also over potential universes (which, as far as I know, might be the one I live in) that require an infinite amount of state bits. Why in the world would one expect a utility function over an uncountable domain to be computable?
As far as I can see, the motivation for requiring a utility function to be computable is that this would make optimization for said utility function to be a great deal easier. Certainly this is true; there are powerful optimization techniques that apply only to computable utility functions, that an optimizer with an uncomputable utility function does not have access to in their full form. But the utility function is not up for grabs; the fact that life will be easier for me if I want a certain thing, should not be taken as an indication that that is want I want! This seems to me like the cart-before-horse error of trying to interpret the problem as one that is easier to solve, rather than the problem one actually wants solved.
One argument is that U() should be computable because the agent has to be able to use it in computations. If you can’t evaluate U(), how are you supposed to use it? If U() exists as an actual module somewhere in the brain, how is it supposed to be implemented?
This line of thought here illustrates very well the (I claim) grossly mistaken intuition for assuming computability. If you can’t evaluate U() perfectly, then perhaps what your brain is doing is only an approximation of what you really want, and perhaps the same constraint will hold for any greater mind that you can devise. But that does not mean that what your brain is optimizing for is necessarily what it actually wants! There is no requirement at all that your brain is a perfect judge of the desirability of the world it’s looking at, after all (and we know for a fact that it does a far from perfect job at this).
This second claim sounds to me as being a bit trivial. Perhaps it is my reverse engineering background, but I have always taken it for granted that approximately any mechanism is understandable by a clever human given enough effort.
This book [and your review] explains a number of particular pieces of understanding of biological systems in detail, which is super interesting; but the mere point that these things can be understood with sufficient study almost feels axiomatic. Ignorance is in the map, not the territory; there are no confusing phenomena, only minds confused by phenomena; etc. Even when I knew nothing about this biological machinery, I never imagined for a second that no understanding was attainable in principle. I only saw *systems that are not optimized for ease of understanding*, and therefore presumably more challenging to understand than systems designed by human engineers which *are* optimized for ease of understanding.
But I get the impression that the real point you are shooting for (and possibly, the point the book is shooting for) is a stronger point than this. Not so much “there is understanding to be had here, if you look deeply enough”, but rather a claim about what *particular type of structure* we are likely to find, and how this may or may not conform to the type of structure that humans are trained to look for.
Is this true? If it is, could you expand on this distinction?
If you are going to include formal proofs with your AI showing that the code does what it’s supposed to, in the style of Coq and friends, then the characteristics of traditionally unsafe languages are not a deal-breaker. You can totally write provably correct and safe code in C, and you don’t need to restrict yourself to a sharply limited version of the language either. You just need to prove that you are doing something sensible each time you perform a potentially unsafe action, such as accessing memory through a pointer.
This slows things down and adds to your burdens of proof, but not really by that much. It’s a few more invariants you need to carry around with you throughout your proofs. Where in a safer language you may have to prove that your Foo list is still sorted after a certain operation, in C you will additionally have to prove that your pointer topology is still what you want after a certain operation. No big deal. In particular, a mature toolkit for proving properties about C programs will presumably have tools for automating away the 99% of trivial proof obligations involving pointer topology, leaving something for you to prove only when you are doing something clever.
For any such property that you can screw up, a language that will not allow you to screw it up in the first place will make your life easier and your proofs simpler, which is why OCaml is more popular as a vehicle for proven correct code than C. But if you are proving the correctness of every aspect of your program anyway, this is a pretty minor gain; the amount of stuff there is to prove about your object-level algorithms will be vastly greater than the requirements added by the lack of safety of your programming language. If only because there will be specialized tools for rapidly dealing with the latter, but not the former.
Not all those specialized tools exist just yet. Currently, the program correctness proof systems are pretty good at proving properties about functions [in the mathematical sense] operating on data structured in a way that mathematicians like to work with, and a bit rubbish at everything else people use software to do; it is no coincidence that Coq, Agda, Idris, Isabelle, and friends all work primarily with purely functional languages using some form of constructed types as a data model. But techniques for dealing with computing applications in a broader sense will have to be on the proof-technology roadmap sooner or later, if correctness proofs are ever going to fulfill their promise outside selected examples. And when they do, there will not be a big difference between programming in C and programming in OCaml as far as proving correctness is concerned.
tl;dr: I don’t think language safety is going to be more than a rounding error if you want to prove the correctness of a large piece of software down to the last bit, once all the techniques for doing that sort of thing at all are in place. The amount of program-specific logic in need of manual analysis is vastly greater than the amount of boilerplate a safe language can avoid.
The point is: if people understood how their bicycle worked, they’d be able to draw one even without having to literally have one in front of them as they drew it!
I don’t think this is actually true. Turning a conceptual understanding into an accurate drawing is a nontrivial skill. It requires substantial spatial visualization ability, as well as quite a bit of drawing skill—one who is not very skilled in drawing, like myself, might poorly draw one part of a bike, want to add two components to it, and then realize that there is no way to add a third component to the poor drawing without turning it into an illegible mess of ink. There is a reason technical drawing is an explicit course in engineering education.
I built a nontrivial construction yesterday, that I understand in great detail and personally designed in OpenSCAD beforehand, that I could not put on paper by hand in a way that is vaguely mechanically accurate, without a visual reference (be it the actual construction or the CAD model). At least, not in one try—I might manage if it I threw away the first three sketches.
Even if such a person decides to do this, they will eventually get fed up and leave.
Will they, necessarily? The structure of the problem you describe sounds a lot like any sort of teaching, which involves a lot of finding out what a student misunderstands about a particular topic and then fixing that, even if you clear up that same misunderstanding for a different student every week. There are lots of people who do not get fed up with that. What makes this so different?
Pigeons have stable, transitive hierarchies of flight leadership, and they have stable pecking order hierarchies, and these hierarchies do not correlate.
one of the things you can do with the power to give instructions is to instruct others to give you more goodies.
It occurs to me that leading a flight is an unusual instruction-giving power, in that it comes with almost zero opportunities to divert resources in your own direction. Choosing where to fly and when to land affects food options, but it does not affect your food options relative to your flight-mates. Most leadership jobs give many more opportunities to turn the position into zero-sum personal benefits.
I suspect this is not a coincidence. Can anyone think of a case where the pecking order and the leadership hierarchy are uncorrelated in a situation where the leadership is exploitable for pecking opportunities?
General rationality question that should not be taken to reflect any particular opinion of mine on the topic at hand:
At what point should “we can’t find any knowledgeable critics offering meaningful criticism against <position>” be interpreted as substantial evidence in favor of <position>, and prompt one to update accordingly?
Having lost this signaling tool, we are that much poorer.
Are we? Signaling value is both a blessing and a curse, and my impression is that it is generally zero-sum. Personally, I consider myself *richer* when a mundane activity or lifestyle choice loses its signaling association, for it means I am now less restricted in applying it.
At the time of writing, for the two spoilers in the main post, hovering over either will reveal both. Is that intentional? It does not seem desirable.
I think there is about a three orders of magnitude difference between the difficulties of “inventing calculus where there was none before” and “learning calculus from a textbook explanation carefully laid out in the optimal order, with each component polished over the centuries to the easiest possible explanation, with all the barriers to understanding carefully paved over to construct the smoothest explanatory trajectory possible”.
(Yes, “three orders of magnitude” is an actual attempt to estimate something, insofar as that is at all meaningful for an unquantified gut instinct; it’s not just something I said for rhetoric effect.)
I think it will be next to impossible to set up a community norm around this issue for all communities save those with a superhuman level of general honesty. For if there is a norm like this in place, Alice always has a strong incentive to pretend that she is punching based on some generally accepted theory, and that the only thing that needs arguing is the application of this theory to Bob (point 2). Even when there is in fact a new piece of theory ready to be factored out of Alice’s argument, it is in Alice’s interest to pass this off as being a straightforward application of some more general principle rather than anything new, and she will almost certainly be able to convincingly pass it off as such.
As soon as there is a community norm around building your new punching theories separately from any actual punches, anyone who can argue that their justification for punching Bob doesn’t need any new theory, that is, that the justification follows trivially from accepted ideas, will have a that much stronger position. Thus, only the most scrupulous of punchers will ever actually implement step (1), and the norm will collapse.
While I’m not disputing the substance of what you are saying here (besides the 4004 timeline), from a computer science perspective I am a bit annoyed at the terminology. A machine that can load computational instructions from a storage medium would traditionally be called a programmable computer, whereas the system you describe “a machine that runs an algorithm” is just precisely a computer. I understand that this is not a nuance represented in more popular terminology, but I feel an article that is precise about the difference could benefit from also using the more precise terminology.