The main way I could see agent foundations research as helping to address these problems, rather than merely deferring them, is if we plan to eschew large-scale ML altogether.
As I understand it, the default Nate prediction is that if we get aligned AGI at all, it’s mostly likely to have a mix of garden-variety narrow-AI ML with things that don’t look like contemporary ML. I wouldn’t describe that as “eschewing large-scale ML altogether”, but possibly Paul would.
I think the more important disagreement here isn’t about how hard it is to use AF to resolve the central difficulties, but rather about how hard it is to resolve the central difficulties with the circa-2018 ML toolbox. Eliezer’s view, from the Sam Harris interview, is:
The depth of the iceberg is: “How do you actually get a sufficiently advanced AI to do anything at all?” Our current methods for getting AIs to do anything at all do not seem to me to scale to general intelligence. If you look at humans, for example: if you were to analogize natural selection to gradient descent, the current big-deal machine learning training technique, then the loss function used to guide that gradient descent is “inclusive genetic fitness”—spread as many copies of your genes as possible. We have no explicit goal for this. In general, when you take something like gradient descent or natural selection and take a big complicated system like a human or a sufficiently complicated neural net architecture, and optimize it so hard for doing X that it turns into a general intelligence that does X, this general intelligence has no explicit goal of doing X.
We have no explicit goal of doing fitness maximization. We have hundreds of different little goals. None of them are the thing that natural selection was hill-climbing us to do. I think that the same basic thing holds true of any way of producing general intelligence that looks like anything we’re currently doing in AI.
If you get it to play Go, it will play Go; but AlphaZero is not reflecting on itself, it’s not learning things, it doesn’t have a general model of the world, it’s not operating in new contexts and making new contexts for itself to be in. It’s not smarter than the people optimizing it, or smarter than the internal processes optimizing it. Our current methods of alignment do not scale, and I think that all of the actual technical difficulty that is actually going to shoot down these projects and actually kill us is contained in getting the whole thing to work at all. Even if all you are trying to do is end up with two identical strawberries on a plate without destroying the universe, I think that’s already 90% of the work, if not 99%.
My understanding is that Paul thinks breaking the evolution analogy is important, but a lot less difficult than Eliezer thinks it is.
I like this post, though I wish it were explicit about the fact that the subject matter is really “relatively intractable disagreements between smart, well-informed people about Big Issues”, not “arguments” in general.
If everyone in the discussion is smart and well-informed, and the subject is a Big Issue, then trying to resolve the issue by bringing up an isolated fact tends to be a worse use of time, or is further away from people’s cruxes, than trying to survey all the evidence, which tends to be worse / less cruxy than delving into high-level generators. But:
A lot of arguments aren’t about Big Issues. One example I’ve seen: Alice and Bob disagreed about whether a politician had made an inflammatory gesture, based on contradictory news reports. Alice tracked down a recording and showed it to Bob, while noting a more plausible explanation for the gesture; this convinced Bob, even though it was a mere fact and not a literature review or philosophical treatise.
A lot of Big-Issue-adjacent arguments aren’t that sophisticated. If you read Scott’s post and then go argue with someone who says “evolution is just a theory”, it will often be the case that the disagreement is best resolved by just clarifying definitions, not by going hunting for deep generators.
An obvious reply is “well, those arguments are bad in their own right; select arguments and people-to-argue-with such that Scott’s pyramid is true, and you’ll be much better off”. I tentatively think that’s not the right approach, even though I agree that the examples I cited aren’t good topics for rationalists to spend time on. Mostly I just think it’s not true that smart people never believe really consequential, large-scale things for trivial-to-refute reasons. Top rationalists don’t know everything, so some of their beliefs will be persistently wrong just because they misunderstood a certain term, never happened to encounter a certain isolated fact, are misremembering the results from a certain study, etc. That can lead to long arguments when the blind spot is hard to spot.
If people neglect mentioning isolated facts or studies (or clarifying definitions) because they feel like it would be lowly or disrespectful, they may just end up wasting time. And I worry that people’s response to losing an argument is often to rationalize some other grounds for their original belief, in which case Scott’s taxonomy can encourage people to mis-identify their own cruxes as being deeper and more intractable than they really are. This is already a temptation because it’s embarrassing to admit that a policy or belief you were leaning on hard was so simple to refute.
(Possibly I don’t have a substantive disagreement with Scott and I just don’t like how many different dimensions of value the pyramid is collapsing. There’s a sense in which arguments toward the top can be particularly valuable, but people who like the pyramid shouldn’t skip over the necessary legwork at the lower levels.)
It’s not wrong, but it’s not locally valid. Here again, I’m going for that sweet irony.
If local validity meant never sharing your confidence levels without providing all your evidence for your beliefs, local validity would be a bad desideratum.
I could trust EY to be right, but personally I don’t. Therefore, EY’s post didn’t really force me to update my estimate of P(“memetic collapse”) in either direction.
Yes. I think that this is a completely normal state of affairs, and if it doesn’t happen very often then there’s probably something very wrong with the community’s health and epistemic hygiene:
Person A makes a claim they don’t have time to back up.
Person B trusts A’s judgment enough to update nontrivially in the direction of the claim. B says as much, but perhaps expresses an interest in hearing the arguments in more detail (e.g., to see if it makes them update further, or out of intellectual curiosity, or to develop a model with more working parts, or to do a spot check on whether they’re correct to trust A that much).
Person C doesn’t trust A’s (or, implicitly, B’s) judgment enough to make a nontrivial update toward the claim. C says as much, and expresses an interest in hearing the arguments in more detail so they can update on the merits directly (and e.g. learn more about A’s reliability).
This situation is a sign of a healthy community (though not a strong sign). There’s no realistic way for everyone to have the same judgments about everyone else’s epistemic reliability — this is another case where it’s just too time-consuming for everyone to fully share all their evidence, though they can do some information-sharing here and there (and it’s particularly valuable to do so with people like Eliezer who get cited so much) — so this should be the normal way of things.
I’m not just saying that B and C’s conduct in this hypothetical is healthy; I think A’s is healthy too, because I don’t think people should hide their conclusions just because they can’t always concisely communicate their premises.
Like I said earlier, I’m sympathetic to the idea that Eliezer should explicitly highlight “this is a point I haven’t defended” in cases like this. I’ve said that I think your criticisms have been inconsistent, unclear, or equivocation-prone on a lot of points, and that I think you’ve been failing a lot on other people’s ITTs here; but I continue to fully endorse your interjection of “I disagree with A on this point” (both as a belief a reasonable person can hold, and as a positive thing for people to express given that they hold it), and I also continue to think that doing more signposting of “I haven’t defended this here” may be a good idea. I’d like to see it discussed more.
You have said this a lot, but I don’t really see why it should be true.
It’s just a really common state of affairs, maybe even the default when you’re talking about most practically important temporal properties of human individuals and groups. Compare claims like “top evopsych journals tend to be more careful and rigorous than top nutrition science journals” or “4th-century AD Roman literature used less complex wordplay and chained literary associations than 1st-century AD Roman literature”.
These are the kinds of claims where it’s certainly possible to reach a confident conclusion if (as it happens) the effect size is large, but where there will be plenty of finicky details and counter-examples and compressing the evidence into an easy-to-communicate form is a pretty large project. A skeptical interlocutor in those cases could reasonably doubt the claim until they see a lot of the same evidence (while acknowledging that other people may indeed have access to sufficient evidence to justify the conclusion).
(Maybe the memetic collapse claim, at the effect size we’re probably talking about, is just a much harder thing to eyeball than those sorts of claims, such that it’s reasonable to demand extraordinary evidence before you think that human brains can reach correct nontrivial conclusions about things like memetic collapse at all. I think that sort of skepticism has some merit to it, and it’s a factor going into my skepticism; I just don’t think the particular arguments you’ve given make sense as factors.)
In a community where we try to assign status/esteem/respect based on epistemics, there’s always some risk that it will be hard to notice evidence of ingroup bias because we’ll so often be able to say “I’m not biased; I’m just correctly using evidence about track records to determine whose views to put more weight on”. I could see an argument for having more of a presumption of bias in order to correct for the fact that our culture makes it hard to spot particular instances of bias when they do occur. On the other hand, being too trigger-happy to yell “bias!” without concrete evidence can cause a lot of pointless arguments, and it’s easy to end up miscalibrated in the end.
I’d also want to explicitly warn against confusing epistemic motivations with ‘I want to make this social heuristic cheater-resistant’ motivations, since I think this is a common problem. Highly general arguments against the existence of hard-to-transmit evidence (or conflation of ‘has the claimant transmitted their evidence?’ with ‘is the claimant’s view reasonable?’) raise a lot of alarm bells for me in line with Status Regulation and Anxious Underconfidence and Hero Licensing.
I’m suggesting that he (Hypothesis) is making an argument that’s almost reasonable, but that he probably wouldn’t accept if the same argument was used to defend a statement he didn’t agree with (or if the statement was made by someone of lower status than EY).
This kind of claim is plausible on priors, but I don’t think you’ve provided Bayesian evidence in this case that actually discriminates pathological ingroup deference from healthy garden-variety deference. “You’re putting more stock in a claim because you agree with other things the claimant has said” isn’t in itself doing epistemics wrong.
In a community where we try to assign status/esteem/respect based on epistemics, there’s always some risk that it will be hard to notice evidence of ingroup bias because we’ll so often be able to say “I’m not biased; I’m just correctly using evidence about track records to determine whose views to put more weight on”. I could see an argument for having more of a presumption of bias in order to correct for the fact that our culture makes it hard to spot particular instances of bias when they do occur. On the other hand, being too trigger-happy to yell “bias!” without concrete evidence can cause a lot of pointless arguments, and it’s easy to end up miscalibrated in the end; the goal is to end up with accurate beliefs about the particular error rate of different epistemic processes, rather than to play Bias Bingo for its own sake.
So on the whole I still think it’s best to focus discussion on evidence that actually helps us discriminate the level of bias, even if it takes some extra work to find that evidence. At least, I endorse that for public conversations targeting specific individuals; making new top-level posts about the problem that speak in generalities doesn’t run into the same issues, and I think private messaging also has less of the pointless-arguments problem.
It might be true that EY’s claim is very hard to prove with any rigor, but that is not a reason to accept it.
Obviously not; but “if someone had a justified true belief in this claim, it would probably be hard to transmit the justification in a blog-post-sized argument” does block the inferences “no one’s written a convincing short argument for this claim, therefore it’s false” and “no one’s written a convincing short argument for this claim, therefore no one has justified belief in it”. That’s what I was saying earlier, not “it must be true because it hasn’t been proven”.
The text of EY’s post suggests that he is quite confident in his belief, but if he has no strong arguments (and especially if no strong arguments can exist), then his confidence is itself an error.
You’re conflating “the evidence is hard to transmit” with “no evidence exists”. The latter justifies the inference to “therefore confidence is unreasonable”, but the former doesn’t, and the former is what we’ve been talking about.
I think we can all agree that “sometimes hypothesis that are hard to prove rigorously happen to be true anyway” is a complete cop-out. Because sometimes hard-to-prove hypotheses also happen to be false.
It’s not a cop-out to say “evidence for this kind of claim can take a while to transmit” in response to ″since you haven’t transmitted strong evidence, doesn’t that mean that your confidence is ipso facto unwarranted?″. It would be an error to say “evidence for this kind of claim can take a while to transmit, therefore the claim is true”, but no one’s said that.
I think you’re probably in a really bad state if you have to lean very much on that with your first AGI system. You want to build the system to not optimize any harder than absolutely necessary, but you also want the system to fail safely if it does optimize a lot harder than you were expecting.
The kind of AGI approach that seems qualitatively like “oh, this could actually work” to me involves more ″the system won’t even try to run searches for solutions to problems you don’t want solved“ and less ”the system tries to find those solutions but fails because of roadblocks you put in the way (e.g., you didn’t give it enough hardware)″.
Just saw this old thread thanks to the spammers! Updated https://intelligence.org/research-guide/ based on this suggestion.
From Scott Aaronson today:
De Grey constructs an explicit graph with unit distances—originally with 1567 vertices, now with 1585 vertices after after a bug was fixed—and then verifies by computer search (which takes a few hours) that 5 colors are needed for it. So, can we be confident that the proof will stand—i.e., that there are no further bugs? See the comments of Gil Kalai’s post for discussion. Briefly, though, it looks like it’s now been independently verified, using different SAT-solvers, that the chromatic number of de Grey’s corrected graph is indeed 5, including by my good friend Marijn Heule at UT Austin.If and when it’s also mechanically checked that the graph is unit distance (i.e., that it can be embedded in the plane with distances of 1), I think it will be time to declare this result correct. Update: De Grey emailed to tell me that this part has now been independently verified as well. I’ll link to details as soon as I have them.
And de Grey commented:
Yesterday I fixed the bug that led me to believe I had a 1567-vertex solution and reran my deleteability-seeking code overnight ; the result is that I now have a 1581-vertex graph (i.e., four vertices can be removed from the graph that various people verified yesterday) and I have stuck a revised manuscript on the arxiv which should go live later today.
Well. That’s really, really, really crazy.
Also, from Noam Elkies on Math Overflow:
It seems that the current status is that the 1567-point graph is 4-colorable (and there was a bug in de Grey’s code), but it was obtained by removing a few too many vertices from a 1585-point graph that’s not 4-colorable (and this has now been [independently] checked by a SAT solver). So we still have a new lower bound of 5 on the chromatic number of the plane; but I need to edit my post because the target is no longer 1567.
However the correct response is not the take the single data point provided more charitably.
You’re conflating two senses of “take a single data point charitably”: (a) “treat the data point as relatively strong evidence for a hypothesis”, and (b) “treat the author as having a relatively benign reason to cite the data point even though it’s weak”. The first is obviously bad (since we’re assuming the data is weak evidence), but you aren’t claiming I did the first thing. The second is more like what I actually said, but it’s not problematic (assuming I have a good estimate of the citer’s epistemics).
“Charity” framings are also confusingly imprecise in their own right, since like “steelmanning,” they naturally encourage people to equivocate between “I’m trying to get a more accurate read on you by adopting a more positive interpretation” and “I’m trying to be nice/polite to you by adopting a more positive interpretation”.
The correct response is to accept that this claim will never have high certainty.
A simple counterexample is “I assign 40:1 odds that my friend Bob has personality trait [blah],” where a lifetime of interactions with Bob can let you accumulate that much confidence without it being easy for you to compress the evidence into an elevator pitch that will push strangers to similar levels of confidence. (Unless the stranger simply defers to your judgment, which is different from them having access to your evidence.)
This post is hardly a “short amount of text”
I think a satisfactory discussion of the memetic collapse claim would probably have to be a lot longer, and a lot of it would just be talking about more data points and considering different interpretations of them.
I think the criticism “isolated data points can cause people to over-update when they’re presented in vivid, concrete terms” makes sense, and this is a big part of why it’s pragmatically valuable to push back against “one hot day ergo climate change”, because even though it’s nonzero Bayesian evidence for climate change, the strength of evidence is way weaker than the emotional persuasiveness. I don’t have a strong view on whether Eliezer should add some more caveats in cases like this to ensure people are aware that he hasn’t demonstrated the memetic collapse thesis here, vs. expecting his readers to appropriately discount vivid anecdotes as a matter of course. I can see the appeal of both options.
I think the particular way you phrased your objection, in terms of “is this a locally valid inference?” rather than “is this likely to be emotionally appealing in a way that causes people to over-update?”, is wrong, though, and I think reflects an insufficiently bright line between personal-epistemics norms like “make good inferences” and social norms like “show your work”. I think you’re making overly strong symmetry claims here in ways that make for a cleaner narrative, and not seriously distinguishing “here’s a data point I’ll treat as strong supporting evidence for a claim where we should expect there to be a much stronger easy-to-communicate/compress argument if the claim is true” and “here’s a data point I’ll use to illustrate a claim where we shouldn’t expect there to be an easy-to-communicate/compress argument if the claim is true”. But it shouldn’t be necessary to push for symmetry here in any case; mistake seriousness is orthogonal to mistake irony.
I remain unconvinced by the arguments I’ve seen for the memetic collapse claim, and I’ve given some counterarguments to collapse claims in the past, but “I think you’re plausibly wrong” and “I haven’t seen enough evidence to find your view convincing” are pretty different from “I think you don’t have lots of unshared evidence for your belief” or “I think you’re making an easily demonstrated inference mistake”. I don’t think the latter two things are true, and I think it would take a lot of time and effort to actually resolve the disagreement.
(Also, I don’t mean to be glib or dismissive here about your ingroup bias worries; this was something I was already thinking about while I was composing my earlier comments, because there are lots of risk factors for motivated reasoning in this kind of discussion. I just want to be clear about what my beliefs and thinking are, factoring in bias risks as a big input.)
I think “give everyone an AGI” comes from this Medium piece that coincided with OpenAI’s launch:
Musk: [… W]e want AI to be widespread. There’s two schools of thought — do you want many AIs, or a small number of AIs? We think probably many is good. And to the degree that you can tie it to an extension of individual human will, that is also good. [...]
Altman: We think the best way AI can develop is if it’s about individual empowerment and making humans better, and made freely available to everyone, not a single entity that is a million times more powerful than any human. [...]
Couldn’t your stuff in OpenAI surpass human intelligence?
Altman: I expect that it will, but it will just be open source and useable by everyone instead of useable by, say, just Google. Anything the group develops will be available to everyone. If you take it and repurpose it you don’t have to share that. But any of the work that we do will be available to everyone. [...]
I want to return to the idea that by sharing AI, we might not suffer the worst of its negative consequences. Isn’t there a risk that by making it more available, you’ll be increasing the potential dangers?
Altman: I wish I could count the hours that I have spent with Elon debating this topic and with others as well and I am still not a hundred percent certain. You can never be a hundred percent certain, right? But play out the different scenarios. Security through secrecy on technology has just not worked very often. If only one person gets to have it, how do you decide if that should be Google or the U.S. government or the Chinese government or ISIS or who? There are lots of bad humans in the world and yet humanity has continued to thrive. However, what would happen if one of those humans were a billion times more powerful than another human?
Musk: I think the best defense against the misuse of AI is to empower as many people as possible to have AI. If everyone has AI powers, then there’s not any one person or a small set of individuals who can have AI superpower.
I don’t think I’ve ever seen actual OpenAI staff endorse strategies like that, though, and they’ve always said they consider openness itself conditional. E.g., Andrej Karpathy from a week or two later:
What if OpenAI comes up with a potentially game-changing algorithm that could lead to superintelligence? Wouldn’t a fully open ecosystem increase the risk of abusing the technology?
In a sense it’s kind of like CRISPR. CRISPR is a huge leap for genome editing that’s been around for only a few years, but has great potential for benefiting — and hurting — humankind. Because of these ethical issues there was a recent conference on it in DC to discuss how we should go forward with it as a society.
If something like that happens in AI during the course of OpenAI’s research — well, we’d have to talk about it. We are not obligated to share everything — in that sense the name of the company is a misnomer — but the spirit of the company is that we do by default.
And Greg Brockman from January 2016:
The one goal we consider immutable is our mission to advance digital intelligence in the way that is most likely to benefit humanity as a whole. Everything else is a tactic that helps us achieve that goal.
Today the best impact comes from being quite open: publishing, open-sourcing code, working with universities and with companies to deploy AI systems, etc.. But even today, we could imagine some cases where positive impact comes at the expense of openness: for example, where an important collaboration requires us to produce proprietary code for a company. We’ll be willing to do these, though only as very rare exceptions and to effect exceptional benefit outside of that company.
In the future, it’s very hard to predict what might result in the most benefit for everyone. But we’ll constantly change our tactics to match whatever approaches seems most promising, and be open and transparent about any changes in approach (unless doing so seems itself unsafe!). So, we’ll prioritize safety given an irreconcilable conflict.
Not every conclusion is easy to conclusively demonstrate to arbitrary smart readers with a short amount of text, so I think it’s fine for people to share their beliefs without sharing all the evidence that got them there, and it’s good for others to flag the places where they disagree and will need to hear more. I think “… because I’ve read a couple of fictional stories that suggest so” is misunderstanding Eliezer’s reasoning / failing his ITT. (Though it’s possible what you meant is that he should be explicit about the fact that he’s relying on evidence/arguments that lack a detailed canonical write-up? That seems reasonable to me.)
I interpret agent foundations as being more about providing formal specifications of metaphilosophical competence, to [...] allow us to formally verify whether a computational process will satisfy desirable metaphilosophical properties
“Adding conceptual clarity” is a key motivation, but formal verification isn’t a key motivation.
The point of things like logical induction isn’t “we can use the logical induction criterion to verify that the system isn’t making reasoning errors”; as I understand it, it’s more ″logical induction helps move us toward a better understanding of what good reasoning is, with a goal of ensuring developers aren’t flying blind when they’re actually building good reasoners″.
Daniel Dewey’s summary of the motivation behind HRAD is:
2) If we fundamentally “don’t know what we’re doing” because we don’t have a satisfying description of how an AI system should reason and make decisions, then we will probably make lots of mistakes in the design of an advanced AI system.
3) Even minor mistakes in an advanced AI system’s design are likely to cause catastrophic misalignment.
To which Nate replied at the time:
I think this is a decent summary of why we prioritize HRAD research. I would rephrase 3 as ″There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running.
The position of the AI community is something like the position researchers would be in if they wanted to build a space rocket, but hadn’t developed calculus or orbital mechanics yet. Maybe with enough trial and error (and explosives) you’ll eventually be able to get a payload off the planet that way, but if you want things to actually work correctly on the first go, you’ll need to do some basic research to cover core gaps in what you know.
To say that calculus or orbital mechanics help you “formally verify” that the system’s parts are going to work correctly is missing where the main benefit lies, which is in knowing what you’re doing at all, not in being able to machine-verify everything you’d like to. You need to formalize how good reasoning works because even if you can’t always apply conventional formal methods, you still need to understand what you’re building if you want robustness properties.
Eliezer wrote this in a private Facebook thread February 2017:
Reminder: Eliezer and Holden are both on record as saying that “steelmanning” people is bad and you should stop doing it.
As Holden says, if you’re trying to understand someone or you have any credence at all that they have a good argument, focus on passing their Ideological Turing Test. “Steelmanning” usually ends up as weakmanning by comparison. If they don’t in fact have a good argument, it’s falsehood to pretend they do. If you want to try to make a genuine effort to think up better arguments yourself because they might exist, don’t drag the other person into it.
And he FB-commented on Ozy’s Against Steelmanning in August 2016:
Be it clear: Steelmanning is not a tool of understanding and communication. The communication tool is the Ideological Turing Test. “Steelmanning” is what you do to avoid the equivalent of dismissing AGI after reading a media argument. It usually indicates that you think you’re talking to somebody as hapless as the media.
The exception to this rule is when you communicate, “Well, on my assumptions, the plausible thing that sounds most like this is...” which is a cooperative way of communicating to the person what your own assumptions are and what you think are the strong and weak points of what you think might be the argument.
Mostly, you should be trying to pass the Ideological Turing Test if speaking to someone you respect, and offering “My steelman might be...?” only to communicate your own premises and assumptions. Or maybe, if you actually believe the steelman, say, “I disagree with your reason for thinking X, but I’ll grant you X because I believe this other argument Y. Is that good enough to move on?” Be ready to accept “No, the exact argument for X is important to my later conclusions” as an answer.
“Let me try to imagine a smarter version of this stupid position” is when you’ve been exposed to the Deepak Chopra version of quantum mechanics, and you don’t know if it’s the real version, or what a smart person might really think is the issue. It’s what you do when you don’t want to be that easily manipulated sucker who can be pushed into believing X by a flawed argument for not-X that you can congratulate yourself for being skeptically smarter than. It’s not what you do in a respectful conversation.
I define “AI alignment” these days roughly the way the Open Philanthropy Project does:
the problem of creating AI systems that will reliably do what their users want them to do even when AI systems become much more capable than their users across a broad range of tasks
More specifically, I think of the alignment problem as “find a way to use AGI systems to do at least some ambitious, high-impact things, without inadvertently causing anything terrible to happen relative to the operator’s explicit and implicit preferences”.
This is an easier goal than “find a way to safely use AGI systems to do everything the operator could possibly want” or “find a way to use AGI systems to do everything everyone could possibly want, in a way that somehow ‘correctly’ aggregates preferences”; I sometimes see problem statements like those referred to as the “full” alignment problem.
It’s a harder goal than “find a way to get AGI systems to do roughly what the operators have in mind, without necessarily accounting for failure modes the operators didn’t think of”. Following the letter of the law rather than the spirit is only OK insofar as the difference between letter and spirit is non-catastrophic relative to the operators’ true implicit preferences.
If developers and operators can’t foresee every potential failure mode, alignment should still mean that the system fails gracefully. If developers make a moral error (relative to their own moral values) but get alignment right, alignment should mean that their moral error doesn’t automatically cause a catastrophe. This does and should mean that alignment is much harder if solutions are more fragile or local and failure modes are harder to foresee.
This way of thinking about the alignment problem seems more useful to me because it factors out questions related to value disagreements and coordination between humans (including Bostrom’s first principal-agent problem), but leaves “aligned” contentful enough that it does actually mean we’re keeping our eye on the ball. We’re not ignoring how catastrophic-accident-prone the system actually is just because the developer was being dumb.
(I guess you’d want a stronger definition if you thought it was realistic that AGI developers might earnestly in their heart-of-hearts just want to destroy the world, since that case does make the alignment problem too trivial.
I’m similarly assuming that there won’t be a deep and irreconcilable values disagreement among stakeholders about whether we should conservatively avoid high risk of mindcrime, though there may be factual disagreements aplenty, and perhaps there are irreconcilable casewise disagreements about where to draw certain normative category boundaries once you move past “just be conservative and leave a wide berth around anything remotely mindcrime-like” and start trying to implement “full alignment” that can spit out the normatively right answer to every important question.)