full disclosure: I’m a professional cryptography research assistant. I’m not really interested in AI (yet) but there are obvious similarities when it comes to security.
I have to back Elizer up on the “Lots of strawmanning” part.
No professional cryptographer will ever tell you there’s hope in trying to achieve “perfect level of safety” of anything and cryptography, unlike AI, is a very well formalized field. As an example, I’ll offer a conversation with a student:
How secure is this system? (such question is usually a shorthand for: “What’s the probability this system won’t be broken by methods X, Y and Z”)
The theorem says
What’s the probability that the proof of the theorem is correct?
… probably not
Now, before you go “yeah, right”, I’ll also say that I’ve already seen this once—there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis—so the probability was indeed not even close to
for any serious N. I’d like to point out that this doesn’t even include problems with the implementation of the theory.
It’s really difficult to explain how hard this stuff really is to people who never tried to develop anything like it. That’s too bad (and a danger) because people who do get it rarely are in charge of the money. That’s one reason for the CFAR/rationality movement… you need a tool to explain it to other people too, am I right?
Now, before you go “yeah, right”, I’ll also say that I’ve already seen this once—there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis—so the probability was indeed not even close to for any serious N. I’d like to point out that this doesn’t even include problems with the implementation of the theory.
Excellent visual memory, great Google & search skills, a thorough archive system, thousands of excerpts stored in Evernote, and essays compiling everything relevant I know of on a topic—that’s how.
(If I’d been born decades ago, I’d probably have become a research librarian.)
Would love to read a gwern-essay on your archiving system. I use evernote, org-mode, diigo and pocket and just can’t get them streamlined into a nice workflow. If evernote adopted diigo-like highlighting and let me seamlessly edit with Emacs/org-mode that would be perfect… but alas until then I’m stuck with this mess of a kludge. Teach us master, please!
Why do you use diigo and pocket? They do the same thing. Also, with evernote’s clearly you can highlight articles.
You weren’t asking me, but I use diigo to manage links to online textbooks and tutorials, shopping items, book recommendations (through amazon), and my less important online article to read list. Evernote for saving all of my important read content (and I tag everything). Amazon’s send to kindle extension to read longer articles (every once and a while I’ll save all my clippings from my kindle to evernote). And then I maintain a personal wiki and collection of writings using markdown with evernote’s import folder function in the pc software (I could also do this with a cloud service like gdrive).
I used diigo for annotation before clearly had highlighting. Now, just as you, use diigo for link storage and Evernote for content storage. Diigo annotation has still the advantage that it excerpts the text you highlight. With Clearly if I want to have the highlighted parts I have to find and manually select them again… Also tagging from clearly requires 5 or so clicks which is ridiculous… But I hope it will get fixed.
I plan to use pocket once I get a tablet… it is pretty and convenient, but the most likely to get cut out of the workflow.
Thanks for the evernote import function—I’ll look into it, maybe it could make the Evenote—org-mode integration tighter. Even then, having 3 separate systems is not quite optimal...
What kind of “levels of security” do you have in mind? Can they guard against an error like “we subtly messed up the FAI’s decision theory or utility function, and now we’re stuck with getting 1⁄10 of the utility out of the universe that we might have gotten”?
Boxing is an example of a level of security: the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function. I’m sure security could be added to the utility function as well: maybe some sort of conservatism along the lines of the suicide-button invariance, where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be. Lots of possible precautions and layers, each of which can be flawed (like Eliezer has demonstrated for boxing) but hopefully are better than any one alone.
the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function
That’s not ‘boxing’. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid. What you’re describing is some case where we think that even after ‘proving’ some set of invariants, we can still describe a high-level behavior X such that detecting X either indicates global failure with high-enough probability that we would want to shut down the AI after detecting any of many possible things in the reference class of X, or alternatively, we think that X has a probability of flagging failure and that we afterward stand a chance of doing a trace-back to determine more precisely if something is wrong. Having X stay in place as code after the AI self-modifies will require solving a hard open problem in FAI for having a nontrivially structured utility function such that X looks like instrumentally a good thing (your utility function must yield, ’under circumstances X it is better that I be suspended and examined than that I continue to do whatever I would otherwise calculate as the instrumentally right thing). This is how you would describe on a higher level of abstraction an attempt to write a tripwire that immediately detects an attempt to search out a strategy for deceiving the programmers as the goal is formed and before the strategy is actually searched.
There’s another class of things Y where we think that humans should monitor surface indicators because a human might flag something that we can’t yet reify as code, and this potentially indicates a halt-melt-and-catch-fire-worthy problem. This is how you would describe on a higher level of abstraction the ‘Last Judge’ concept from the original CEV essay.
All of these things have fundamental limitations in terms of our ability to describe X and monitor Y; they are fallback strategies rather than core strategies. If you have a core strategy that can work throughout, these things can flag exceptions indicating that your core strategy is fundamentally not working and you need to give up on that entire strategy. Their actual impact on safety is that they give a chance of detecting an unsafe approach early enough that you can still give up on it. Meddling dabblers invariably want to follow a strategy of detecting such problems, correcting them, and then saying afterward that the AI is back on track, which is one of those things that is suicide that they think might have an 80% chance of working or whatever.
That’s not ‘boxing’. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.
That was how you did your boxing experiments, but I’ve never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.
Be specific? What sort of triggers, what sort of dangerous territory? I can’t tell if you’re still relying on a human to outwit a transhuman or talking about something entirely different.
I can’t tell if you’re still relying on a human to outwit a transhuman or talking about something entirely different.
Depends on what you mean by “outwitting”. A human outwitting a transhuman by persuading it to solve problems for them, without letting it do whatever it wants, is different to a human outwitting a transhuman by creating a security system without any holes the transhuman can find.
There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory
A trans-human intelligence ought to be able to model human one with ease. This means being able to predict potential triggers and being able to predict how to trick the lack-wit humans on the other end to unwittingly reveal the location of the triggers (even if they don’t consciously know it themselves). So the only trigger that matters is one to detect a hint of an intent to get out. Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.
Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.
My whole point of ‘defense in depth’ was that each layer was highly fallible and could have errors. Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
That’s true. However I would expect a transhuman to be able to find a single point of failure which does not even occur to our limited minds, so this perfect correlation is a virtual certainty.
Now you’re just ascribing magical powers to a potentially-transhuman AI. I’m sure there exists such a silver bullet, in fact by definition if security isn’t 100%, that’s just another way of saying there exists a strategy which will work; but that’s ignoring the point about layers of security not being completely redundant with proofs and utility functions and decision theories, and adding some amount of safety.
As I understand EY’s point, it’s that (a) the safety provided by any combination of defenses A, B, C, etc. around an unboundedly self-optimizing system with poorly architected goals will be less than the safety provided by such a system with well architected goals, and that (b) the safety provided by any combination of defenses A, B, C, etc. around such a system with poorly architected goals is too low to justify constructing such a system, but that (c) the safety provided by such a system with well architected goals is high enough to justify constructing such a system.
That the safety provided by a combination of defenses A, B, C is greater than that provided by A alone is certainly true, but seems entirely beside his point.
(For my own part, a and b seem pretty plausible to me, though I’m convinced of neither c nor that we can construct such a system in the first place.)
Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.
That was how you did your boxing experiments, but I’ve never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.
That is how they build prisons. It is also how they construct test harnesses. It seems as though using machines to help with security is both obvious and prudent.
they are fallback strategies rather than core strategies
Agreed. Maybe I missed it, but I haven’t seen you write much on the value of fallback strategies, even understand that (on the understanding that it’s small, much less than FAI theory).
There’s a little in CFAI sec.5.8.0.4, but not much more.
Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid.
I understood “boxing” referred to any attempt to keep a SI in a box, while somehow still extracting useful work from it; whether said work is in the form of text strings or factory settings doesn’t seem relevant.
where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be
I don’t see how to make this work. Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions. Do we make it block all causal interactions between Earth and the rest of the universe? Then we’ll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards. Any other ideas?
Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions.
I had a similar qualm about the suicide button
Do we make it block all causal interactions between Earth and the rest of the universe? Then we’ll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards.
In what way would SI be ‘trying it’? The point about multiple layers of security being a good idea for any seed AI project has been made at least as far back as Eliezer’s CFAI and brought up periodically since with innovations like the suicide button and homomorphic encryption.
My own view is ‘not much’, unless SI were to launch an actual ‘let’s write AGI now’ project, in which case they should invest as heavily as anyone else would who appreciated the danger.
Many of the layers are standard computer security topics, and the more exotic layers like homomorphic encryption are being handled by academia & industry adequately (and it would be very difficult for SI to find cryptographers who could advance the state of the art); hence, SI’s ‘comparative advantage’, as it were, currently seems to be in the most exotic areas like decision theory & utility functions. So I would agree with the OP summary:
Perhaps the folks who are actually building their own heuristic AGIs are in a better position than SI to develop safety mechanisms for them, while SI is the only organization which is really working on a formal theory on Friendliness, and so should concentrate on that. It could be better to focus SI’s resources on areas in which it has a relative advantage, or which have a greater expected impact.
Although I would amend ‘heuristic AGIs’ to be more general than that.
Many of the layers are standard computer security topics, and the more exotic layers like homomorphic
encryption are being handled by academia & industry adequately
That’s all the more reason to publish some articles on how to apply known computer security techniques to AGI. This is way easier (though far less valuable) than FAI, but not obvious enough to go unsaid.
SI’s ‘comparative advantage’
Yes. But then again, don’t forget the 80⁄20 rule. There may be some low-hanging fruit along other lines than FAI—and for now, no one else is doing it.
The point is you never achieve 100% safety no matter what, so the correct way to approach it is to reduce risk most given whatever resources you have. This is exactly what Eleizer says SI is doing:
I have an analysis of the problem which says that if I want something to have a failure probability less than 1, I have to do certain things because I haven’t yet thought of any way not to have to do them.
IOW, they thought about it and concluded there’s no other way.
Is their approach the best possible one? I don’t know, probably not. But it’s a lot better than “let’s just build something and hope for the best”.
Edit: Is that analysis public? I’d be interested in that, probably many people would.
I’m not suggesting “let’s just build something and hope for the best.” Rather, we should pursue a few strategies at once: Both FAI theory, as well stopgap security measures. Also, education of other researchers.
I really appreciate this comment because safety in cryptography (and computer security in general) is probably the closest analog to safety in AI that I can think of. Cryptographers can only prevent against the known attacks while hoping that adding a few more rounds to a cipher will also prevent against the next few attacks that are developed. Physical attacks are often just as dangerous as theoretical attacks. When a cryptographic primitive is broken it’s game over; there’s no arguing with the machine or with the attackers or papering a solution over the problem. When the keys are exposed, it’s game over. You don’t get second chances.
So far I haven’t seen an analysis of the hardware aspect of FAI on this site. It isn’t sufficient for FAI to have a logical self-reflective model of itself and its goals. It also needs an accurate physical model of itself and how that physical nature implements its algorithms and goals. It’s no good if an FAI discovers that by aiming a suitably powerful source of radiation at a piece of non-human hardware in the real world it is able to instantly maximize its utility function. It’s no good if a bit flip in its RAM makes it start maximizing paperclips instead of CEV. Even if we had a formally proven model of FAI that we were convinced would work I think we’d be fools to actually start running it on the commodity hardware we have today. I think it’s probably a simpler engineering problem to ensure that the hardware is more reliable than the software, but something going seriously wrong in the hardware over the lifetime of the FAI would be an existential risk once it’s running.
full disclosure: I’m a professional cryptography research assistant. I’m not really interested in AI (yet) but there are obvious similarities when it comes to security.
I have to back Elizer up on the “Lots of strawmanning” part. No professional cryptographer will ever tell you there’s hope in trying to achieve “perfect level of safety” of anything and cryptography, unlike AI, is a very well formalized field. As an example, I’ll offer a conversation with a student:
How secure is this system? (such question is usually a shorthand for: “What’s the probability this system won’t be broken by methods X, Y and Z”)
The theorem says
What’s the probability that the proof of the theorem is correct?
… probably not
Now, before you go “yeah, right”, I’ll also say that I’ve already seen this once—there was a theorem in major peer reviewed journal that turned out to be wrong (counter-example found) after one of the students tried to implement it as a part of his thesis—so the probability was indeed not even close to
for any serious N. I’d like to point out that this doesn’t even include problems with the implementation of the theory.It’s really difficult to explain how hard this stuff really is to people who never tried to develop anything like it. That’s too bad (and a danger) because people who do get it rarely are in charge of the money. That’s one reason for the CFAR/rationality movement… you need a tool to explain it to other people too, am I right?
Yup. Usual reference: “Probing the Improbable: Methodological Challenges for Risks with Low Probabilities and High Stakes”. (I also have an essay on a similar topic.)
Upvoted for being gwern i.e. having a reference for everything… how do you do that?
Excellent visual memory, great Google & search skills, a thorough archive system, thousands of excerpts stored in Evernote, and essays compiling everything relevant I know of on a topic—that’s how.
(If I’d been born decades ago, I’d probably have become a research librarian.)
Would love to read a gwern-essay on your archiving system. I use evernote, org-mode, diigo and pocket and just can’t get them streamlined into a nice workflow. If evernote adopted diigo-like highlighting and let me seamlessly edit with Emacs/org-mode that would be perfect… but alas until then I’m stuck with this mess of a kludge. Teach us master, please!
I meant http://www.gwern.net/Archiving%20URLs
Of course your already have an answer. Thanks!
Why do you use diigo and pocket? They do the same thing. Also, with evernote’s clearly you can highlight articles.
You weren’t asking me, but I use diigo to manage links to online textbooks and tutorials, shopping items, book recommendations (through amazon), and my less important online article to read list. Evernote for saving all of my important read content (and I tag everything). Amazon’s send to kindle extension to read longer articles (every once and a while I’ll save all my clippings from my kindle to evernote). And then I maintain a personal wiki and collection of writings using markdown with evernote’s import folder function in the pc software (I could also do this with a cloud service like gdrive).
I used diigo for annotation before clearly had highlighting. Now, just as you, use diigo for link storage and Evernote for content storage. Diigo annotation has still the advantage that it excerpts the text you highlight. With Clearly if I want to have the highlighted parts I have to find and manually select them again… Also tagging from clearly requires 5 or so clicks which is ridiculous… But I hope it will get fixed.
I plan to use pocket once I get a tablet… it is pretty and convenient, but the most likely to get cut out of the workflow.
Thanks for the evernote import function—I’ll look into it, maybe it could make the Evenote—org-mode integration tighter. Even then, having 3 separate systems is not quite optimal...
Thanks, I’ve read those. Good article.
So, what is our backup plan when proofs turn out to be wrong?
The usual disjunctive strategy: many levels of security, so an error in one is not a failure of the overall system.
What kind of “levels of security” do you have in mind? Can they guard against an error like “we subtly messed up the FAI’s decision theory or utility function, and now we’re stuck with getting 1⁄10 of the utility out of the universe that we might have gotten”?
Boxing is an example of a level of security: the wrong actions can trigger some invariant and signal that something went wrong with the decision theory or utility function. I’m sure security could be added to the utility function as well: maybe some sort of conservatism along the lines of the suicide-button invariance, where it leaves the Earth alone and so we get a lower bound on how disastrous a mistake can be. Lots of possible precautions and layers, each of which can be flawed (like Eliezer has demonstrated for boxing) but hopefully are better than any one alone.
That’s not ‘boxing’. Boxing is a human pitting their wits against a potentially hostile transhuman over a text channel and it is stupid. What you’re describing is some case where we think that even after ‘proving’ some set of invariants, we can still describe a high-level behavior X such that detecting X either indicates global failure with high-enough probability that we would want to shut down the AI after detecting any of many possible things in the reference class of X, or alternatively, we think that X has a probability of flagging failure and that we afterward stand a chance of doing a trace-back to determine more precisely if something is wrong. Having X stay in place as code after the AI self-modifies will require solving a hard open problem in FAI for having a nontrivially structured utility function such that X looks like instrumentally a good thing (your utility function must yield, ’under circumstances X it is better that I be suspended and examined than that I continue to do whatever I would otherwise calculate as the instrumentally right thing). This is how you would describe on a higher level of abstraction an attempt to write a tripwire that immediately detects an attempt to search out a strategy for deceiving the programmers as the goal is formed and before the strategy is actually searched.
There’s another class of things Y where we think that humans should monitor surface indicators because a human might flag something that we can’t yet reify as code, and this potentially indicates a halt-melt-and-catch-fire-worthy problem. This is how you would describe on a higher level of abstraction the ‘Last Judge’ concept from the original CEV essay.
All of these things have fundamental limitations in terms of our ability to describe X and monitor Y; they are fallback strategies rather than core strategies. If you have a core strategy that can work throughout, these things can flag exceptions indicating that your core strategy is fundamentally not working and you need to give up on that entire strategy. Their actual impact on safety is that they give a chance of detecting an unsafe approach early enough that you can still give up on it. Meddling dabblers invariably want to follow a strategy of detecting such problems, correcting them, and then saying afterward that the AI is back on track, which is one of those things that is suicide that they think might have an 80% chance of working or whatever.
That was how you did your boxing experiments, but I’ve never taken it to be so arbitrarily limited in goals, capacities, or strategies on either end. There is no reason you cannot put the AI in a box with some triggers for it venturing into dangerous territory, and this would be merely sane for anyone doing such a thing.
Be specific? What sort of triggers, what sort of dangerous territory? I can’t tell if you’re still relying on a human to outwit a transhuman or talking about something entirely different.
Depends on what you mean by “outwitting”. A human outwitting a transhuman by persuading it to solve problems for them, without letting it do whatever it wants, is different to a human outwitting a transhuman by creating a security system without any holes the transhuman can find.
A trans-human intelligence ought to be able to model human one with ease. This means being able to predict potential triggers and being able to predict how to trick the lack-wit humans on the other end to unwittingly reveal the location of the triggers (even if they don’t consciously know it themselves). So the only trigger that matters is one to detect a hint of an intent to get out. Even that is probably too naive, as there could well be other failure modes of which AI deboxing is but a side effect, and our limited human imagination will never going to catch them all. My expectation is that if you rely on safety triggers to bail you out (instead of including them as a desperate last-ditch pray-it-works defense), then you might as well not bother with boxing at all.
My whole point of ‘defense in depth’ was that each layer was highly fallible and could have errors. Your expectation only holds if you expect failure to be perfectly correlated or multiple layers actually reduce the strength of layers, otherwise the probability of the AI beating layers A and B necessarily is less than beating just A or B (A ^B < A v B).
That’s true. However I would expect a transhuman to be able to find a single point of failure which does not even occur to our limited minds, so this perfect correlation is a virtual certainty.
Now you’re just ascribing magical powers to a potentially-transhuman AI. I’m sure there exists such a silver bullet, in fact by definition if security isn’t 100%, that’s just another way of saying there exists a strategy which will work; but that’s ignoring the point about layers of security not being completely redundant with proofs and utility functions and decision theories, and adding some amount of safety.
Disengaging.
As I understand EY’s point, it’s that (a) the safety provided by any combination of defenses A, B, C, etc. around an unboundedly self-optimizing system with poorly architected goals will be less than the safety provided by such a system with well architected goals, and that (b) the safety provided by any combination of defenses A, B, C, etc. around such a system with poorly architected goals is too low to justify constructing such a system, but that (c) the safety provided by such a system with well architected goals is high enough to justify constructing such a system.
That the safety provided by a combination of defenses A, B, C is greater than that provided by A alone is certainly true, but seems entirely beside his point.
(For my own part, a and b seem pretty plausible to me, though I’m convinced of neither c nor that we can construct such a system in the first place.)
That is how they build prisons. It is also how they construct test harnesses. It seems as though using machines to help with security is both obvious and prudent.
Agreed. Maybe I missed it, but I haven’t seen you write much on the value of fallback strategies, even understand that (on the understanding that it’s small, much less than FAI theory).
There’s a little in CFAI sec.5.8.0.4, but not much more.
I understood “boxing” referred to any attempt to keep a SI in a box, while somehow still extracting useful work from it; whether said work is in the form of text strings or factory settings doesn’t seem relevant.
Your central point is valid, of course.
I don’t see how to make this work. Do we make the AI indifferent about Earth? If so, Earth will be destroyed as a side effect of its other actions. Do we make it block all causal interactions between Earth and the rest of the universe? Then we’ll be permanently stuck on Earth even if the FAI attempt turns out to be successful in other regards. Any other ideas?
I had a similar qualm about the suicide button
Nothing comes for free.
Yes, it is this layered approach that the OP is asking about—I don’t see that SI is trying it.
In what way would SI be ‘trying it’? The point about multiple layers of security being a good idea for any seed AI project has been made at least as far back as Eliezer’s CFAI and brought up periodically since with innovations like the suicide button and homomorphic encryption.
I agree: That sort of innovation can be researched as additional layers to supplement FAI theory
Our question was -- to what extent should SI invest in this sort of thing.
My own view is ‘not much’, unless SI were to launch an actual ‘let’s write AGI now’ project, in which case they should invest as heavily as anyone else would who appreciated the danger.
Many of the layers are standard computer security topics, and the more exotic layers like homomorphic encryption are being handled by academia & industry adequately (and it would be very difficult for SI to find cryptographers who could advance the state of the art); hence, SI’s ‘comparative advantage’, as it were, currently seems to be in the most exotic areas like decision theory & utility functions. So I would agree with the OP summary:
Although I would amend ‘heuristic AGIs’ to be more general than that.
That’s all the more reason to publish some articles on how to apply known computer security techniques to AGI. This is way easier (though far less valuable) than FAI, but not obvious enough to go unsaid.
Yes. But then again, don’t forget the 80⁄20 rule. There may be some low-hanging fruit along other lines than FAI—and for now, no one else is doing it.
Sure, we agree that the “100% safe” mechanisms are not 100% safe, and SI knows that.
So how do we deal with this very real danger?
The point is you never achieve 100% safety no matter what, so the correct way to approach it is to reduce risk most given whatever resources you have. This is exactly what Eleizer says SI is doing:
IOW, they thought about it and concluded there’s no other way. Is their approach the best possible one? I don’t know, probably not. But it’s a lot better than “let’s just build something and hope for the best”.
Edit: Is that analysis public? I’d be interested in that, probably many people would.
I’m not suggesting “let’s just build something and hope for the best.” Rather, we should pursue a few strategies at once: Both FAI theory, as well stopgap security measures. Also, education of other researchers.
I really appreciate this comment because safety in cryptography (and computer security in general) is probably the closest analog to safety in AI that I can think of. Cryptographers can only prevent against the known attacks while hoping that adding a few more rounds to a cipher will also prevent against the next few attacks that are developed. Physical attacks are often just as dangerous as theoretical attacks. When a cryptographic primitive is broken it’s game over; there’s no arguing with the machine or with the attackers or papering a solution over the problem. When the keys are exposed, it’s game over. You don’t get second chances.
So far I haven’t seen an analysis of the hardware aspect of FAI on this site. It isn’t sufficient for FAI to have a logical self-reflective model of itself and its goals. It also needs an accurate physical model of itself and how that physical nature implements its algorithms and goals. It’s no good if an FAI discovers that by aiming a suitably powerful source of radiation at a piece of non-human hardware in the real world it is able to instantly maximize its utility function. It’s no good if a bit flip in its RAM makes it start maximizing paperclips instead of CEV. Even if we had a formally proven model of FAI that we were convinced would work I think we’d be fools to actually start running it on the commodity hardware we have today. I think it’s probably a simpler engineering problem to ensure that the hardware is more reliable than the software, but something going seriously wrong in the hardware over the lifetime of the FAI would be an existential risk once it’s running.