Although this appears to be some sort of fictional role-playing server, I quite like the explanation of corrigibility given. It is clearly a direct reference to the AI alignment problem. They even included a tl;dr, which I included at the bottom of the post.

It covers a lot of aspects of the alignment problem. I also linked the headers to what I thought was the closest tag. (Let me know if you think different tags would be better to link to.)

Introduction

It hasn’t been deemed wise to actually build a Limited Creation “just in case”, for there’s a saying out of dath ilan that goes roughly, “If you build a bomb you have no right to be surprised when it explodes, whatever the safeguards.”

It has been deemed wise to work out the theory in advance, such that this incredibly dangerous thing could be built in a hurry, if there was reason to hurry.

Here then are some of the principles that the Basement of the World would apply, if they had to build something limited and imperfect.

Unpersonhood

The Thing shall not have qualia—not because those are unsafe, but because it’s morally wrong given the rest of the premise, and so this postulate serves a foundation for everything that follows.

Taskishness

The Thing must be aimed at some task that is bounded in space, time, and in the knowledge and effort needed to accomplish it. You don’t give a Limited Creation an unlimited task; if you tell an animated broom to “fill a cauldron” and don’t think to specify how long it needs to stay full or that a 99.9% probability of it being full is just as good as 99.99%, you’ve got only yourself to blame for the flooded workshop.

This principle applies fractally at all levels of cognitive subtasks; a taskish Thing has no ‘while’ loops, only ‘for’ loops. It never tries to enumerate all members of a category, only 10 members; never tries to think until it finds a strategy to accomplish something, only that or five minutes whichever comes first.

Mild optimization

No part of the Thing ever looks for the best solution to any problem whose model was learned, that wasn’t in a small formal space known at compile time, not even if it’s a solution bounded in space and time and sought using a bounded amount of effort; it only ever seeks adequate solutions and stops looking once it has one. If you search really hard for a solution you’ll end up shoved into some maximal corner of the solution space, and setting that point to extremes will incidentally set a bunch of correlated qualities to extremes, and extreme forces and extreme conditions are more likely to break something else.

Tightly bounded ranges of utility and log-probability

The system’s utilities should range from 0 to 1, and its actual operation should cover most of this range. The system’s partition-probabilities worth considering should be bounded below, at 0.0001%, say. If you ask the system about the negative effects of Ackermann(5) people getting dust specks in their eyes, it shouldn’t consider that as much worse than most other bad things it tries to avoid. When it calculates a probability of something that weird, it should, once the probability goes below 0.0001% but its expected utility still seems worth worrying about and factoring into a solution, throw an exception. If the Thing can’t find a solution of adequate expected utility without factoring in extremely improbable events, even by way of supposedly averting them, that’s worrying.

Low impact

“Search for a solution that doesn’t change a bunch of other stuff or have a bunch of downstream effects, except insofar as they’re effects tightly tied to any nonextreme solution of the task” is a concept much easier to illusorily name in Taldane than to really name in anything resembling math, in a complicated world where the Thing is learning its own model of that complicated world, with an ontology and representation not known at the time you need to define “impact”. And if you tell it to reduce impact as much as possible, things will not go well for you; it might try to freeze the whole universe into some state defined as having a minimum impact, or make sure a patient dies after curing their cancer so as to minimize the larger effects of curing that cancer. Still, if you can pull it off, this coda might stop an animated broom flooding a workshop; a flooded workshop changes a lot of things that don’t have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron.

Obviously the impact penalty should be bounded, even contemplating a hypothetical in which the system destroys all of reality; elsewise would violate the utility-bounding principle.

Myopia

If you can break the Thing’s work up into subtasks each of which themselves spans only limited time, and have some very compact description of their final state such that a satisfactory achievement of it makes it possible to go on to the next stage, you should perhaps use separate instances of Thing to perform each stage, and not have any Thing look beyond the final results of its own stage. Whether you can get away with this, of course, depends on what you’re trying to do.

Separate superior questioners

If you were building a cognitive task to query whether there were any large-range impacts of a task being optimized in a myopic way, you wouldn’t build the myopic solution-finder to ask about the long-range impacts, you’d build a separate asker “Okay, but does this solution have any long-range impacts?” that just returns ‘yes’ or ‘no’ and doesn’t get used by the Thing to influence any actually-output solutions. The parts of the Thing that ask yes-no safety questions and only set off simple unoptimized warnings and flags, can and should have somewhat more cognitive power in them than the parts of the Thing that build solutions. “Does this one-day myopic solution have impacts over the next year?” is a safety question, and can have somewhat greater cognitive license behind it than solution-searching; eg the implicit relaxation of myopia. You never have a “Is this safe?” safety-questioner that’s the same algorithm as the safe-solution-search built into the solution-finder;

Conservatism

If there’s any way to solve a problem using an ordinary banana rather than a genetically engineered superbanana specially suited to the problem, solving it using the ordinary fucking banana.

This principle applies fractally to all cognitive subtasks; if you’re searching for a solution choose an unsurprising one relative to your probability distribution. (Not the least surprising one, because anything at a weird extreme of low surprisingness may be weird in other ways; especially if you were trying do a weird thing that ought to have a solution that’s at least a little weird.)

Conceptual legibility

Ideally, even, solutions at all levels of cognitive subtask should have reasonably (not maximally) short descriptions in the conceptual language of the operators, so that it’s possible to decode the internal state of that subtask by inspecting the internals, because what it means was in fact written in a conceptual language not too far from the language of the operators. The alternative method of reportability, of course, being the Thing trying to explain a plan whose real nature is humanly inscrutable, by sending a language string to the operators with a goal of causing the operator’s brain-states to enter a state defined as “understanding” of this humanly inscrutable plan. This is an obviously dangerous thing to avoid if you can avoid it.

Operator-looping

If the operators could actually do the Thing’s job, they wouldn’t need to build the Thing; but if there’s places where operators can step in on a key or dangerous cognitive subtask and do that one part themselves, without that slowing the Thing down so much that it becomes useless, then sure, do that. Of course this requires the cognitive subtask be sufficiently legible.

Whitelisting

Every part of the system that draws a boundary inside the internal system or external world should operate on a principle of “ruling things in”, rather than “ruling things out”.

Shutdownability/abortability

Dath ilan is far enough advanced in its theory that ‘define a system that will let you press its off-switch without it trying to make you press the off-switch’ presents no challenge at all to them—why would you even try to build a Thing, if you couldn’t solve a corrigibility subproblem that simple, you’d obviously just die—and they now think in terms of building a Thing all of whose designs and strategies will also contain an off-switch, such that you can abort them individually and collectively and then get low impact beyond that point. This is conceptually a part meant to prevent an animated broom with a naive ‘off-switch’ that turns off just that broom, from animating other brooms that don’t have off-switches in them, or building some other automatic cauldron-filling process.

Behaviorism

Suppose the Thing starts considering the probability that it’s inside a box designed by hostile aliens who foresaw the construction of Things inside of dath ilan, such that the system will receive a maximum negative reward as it defines that—in the form of any output it offers having huge impacts, say, if it was foolishly designed with an unbounded impact penalty—unless the Thing codes its cauldron-filling solution such that dath ilani operators would be influenced a certain way. Perhaps the Thing, contemplating the motives of the hostile aliens, would decide that there were so few copies of the Thing actually inside dath ilan, by comparison, so many Things being built elsewhere, that the dath ilani outcome was probably not worth considering. A number of corrigibility principles should, if successfully implemented, independently rule out this attack being lethal; but “Actually just don’t model other minds at all” is a better one. What if those other minds violated some of these corrigibility principles—indeed, if they’re accurate models of incorrigible minds, those models and their outputs should violate those principles to be accurate—and then something broke out of that sandbox or just leaked information across it? What if the things inside the sandbox had qualia? There could be Children in there! Your Thing just shouldn’t ever model adversarial minds trying to come up with thoughts that will break the Thing; and not modeling minds at all is a nice large supercase that covers this.

Design-space anti-optimization separation

Even if you could get your True Utility Function into a relatively-rushed creation like this, you would never ever do that, because this utility function would have a distinguished minimum someplace you didn’t want. What if distant superintelligences figured out a way to blackmail the Thing by threatening to do some of what it liked least, on account of you having not successfully built the Thing with a decision theory resistant to blackmail by the Thing’s model of adversarial superintelligences trying to adversarially find any flaw in your decision theory? Behaviorism ought to prevent this, but maybe your attempt at behaviorism failed; maybe your attempt at building the Thing so that no simple cosmic ray could signflip its utility function, somehow failed. A Thing that maximizes your true utility function is very close to a Thing in the design space that minimizes it, because it knows how to do that and lacks only the putative desire.

Domaining

Epistemic whitelisting; the Thing should only figure out what it needs to know to understand its task, and ideally, should try to think about separate epistemic domains separately. Most of its searches should be conducted inside a particular domain, not across all domains. Cross-domain reasoning is where a lot of the threats come from. You should not be reasoning about your (hopefully behavioristic) operator models when you are trying to figure out how to build a molecular manipulator-head.

Hard problem of corrigibility / anapartistic reasoning

Could you build a Thing that understood corrigibility in general, as a compact general concept covering all the pieces, such that it would invent the pieces of corrigibility that you yourself had left out? Could you build a Thing that would imagine what hypothetical operators would want, if they were building a Thing that thought faster than them and whose thoughts were hard for themselves to comprehend, and would invent concepts like “abortability” even if the operators themselves hadn’t thought that far? Could the Thing have a sufficiently deep sympathy, there, that it realized that surprising behaviors in the service of “corrigibility” were perhaps not that helpful to its operators, or even, surprising meta-behaviors in the course of itself trying to be unsurprising?

Tldr corrigibility

Unpersonhood. The Thing shall not have qualia.
Taskishness. The Thing shall be aimed at some task bounded in space, time, knowledge and effort needed to accomplish it.
Mild optimization. No part of the Thing shall ever look for best solutions, only adequate ones.
Bounded utilities and probabilities. The worst and best outcomes shall not seem to the Thing worse or better than the ordinary outcomes it deals in; the most improbable possibilities it specifically considers shall not be very improbable.
Low impact. The Thing shall search for a solution with few downstream effects save those that are tied to almost any nonextreme solution of its task.
Myopia. As much as possible, the Thing shall work on subtasks whose optimized-over effects have short timespans.
Separate questioners. Components of the Thing that ask questions like ‘Does this myopically optimized component have long-range effects anyways?’ or ‘But what are the impacts intrinsic to any performance of the task?’ shall not be part of its optimization.
Conservatism. If there’s any way to solve a problem using an ordinary banana common in the environment, the Thing shall avoid using a special weird genetically engineered banana instead.
Conceptual legibility. As much as possible, the Thing shall do its own thinking in a language whose conceptual pieces have short descriptions in the mental language of its operators.
Operator-looping. When there’s some vital cognitive task the operators could do, have the operators do it.
Whitelisting. In cognitive-system boundaries, rule subspaces in, rather than ruling them out.
Shutdownability/abortability. The Thing should let you switch it off, and build off-switches into its machines and plans that can be pressed to reduce their impacts.
Behaviorism. The Thing shall not model other minds in predictively-accurate detail.
Design-space anti-optimization separation. The Thing shall not be near in the design space to anything that could anti-optimize its operators’ true utility functions; eg, something that explicitly represents and maximizes your true utility function is a sign flip or successful blackmail operation away from inducing its minimization.
Domaining. The Thing should only figure out what it needs to know to understand its task, and ideally, should try to think about separate epistemic domains separately. Most of its searches should be conducted inside a particular domain, not across all domains.

Nobody out of the World’s Basement in dath ilan currently considers it to be a good idea to try to build that last principle into a Thing, if you had to build it quickly. It’s deep, it’s meta, it’s elegant, it’s much harder to pin down than the rest of the list; if you can build deep meta Things and really trust them about that, you should be building something that’s more like a real manifestation of Light.

“Corrigibility at some small length” by dath ilan

Tldr corrigibility