The more recent Safeguarded AI document has some parts that seem to me to go against the interpretation I had, which seems to go along the lines of this post.
Namely, that davidad’s proposal was not “CEV full alignment on AI that can be safely scaled without limit” but rather “sufficient control of AI that is as little more powerful as possible than sufficiently powerful for ethical global non-proliferation”.
In other words:
A) “this doesn’t guarantee a positive future but buys us time to solve alignment”
B) “a sufficiently powerful superintelligence would blow right through these constraints but they hold at the power level we think is enough for A”, thus implying “we also need boundedness somehow”.
The Safeguarded AI document says this though:
and that this milestone could be achieved, thereby making it safe to unleash the full potential of superhuman AI agents, within a time frame that is short enough (<15 years) [bold mine]
and
and with enough economic dividends along the way (>5% of unconstrained AI’s potential value) [bold mine][1]
I’m probably missing something, but that seems to imply a claim that the control approach would be resilient against arbitrarily powerful misaligned AI?
A related thing I’m confused about is the part that says:
one eventual application of these safety-critical assemblages is defending humanity against potential future rogue AIs [bold mine]
Whereas I previously thought that the point of the proposal was to create AI powerful-enough and controlled-enough to ethically establish global non-proliferation (so that “potential future rogue AIs” wouldn’t exist in the first place), it now seems to go in the direction of Good(-enough) AI defending against potential Bad AI?
- ^
The “unconstrained AI” in this sentence seems to be about how much value would be achieved from adoption of the safe/constrained design versus the counterfactual value of mainstream/unconstrained AI. My mistake.
The “constrained” still seems to refer to whether there’s a “box” around the AI, with all output funneled through formal verification checks on their predicted consequences. It does not seem to refer to a constraint on the “power level” (“boundedness”) of the AI within the box.
I can see how advancing those areas would empower membranes to be better at self-defense.
I’m having a hard time visualizing how explicitly adding concept, formalism, or implementation of membranes/boundaries would help advance those areas (and in turn help empower membranes more).
For example, is “what if we add membranes to loom” a question that typechecks? What would “add membranes” reify as in a case like that?
In the other direction, would there be a way to model a system’s (stretch goal: human child’s; mvp: a bargaining bot’s?) membrane quantitatively somehow, in a way where you can before/after compare different interventions and estimate how well each does at empowering/protecting the membrane? Would it have a way of distinguishing amount-of-protection added from outside vs inside? Does “what if we add loom to membranes” compile?