Max H comments on [missing post]

Max H 14 Jun 2023 22:01 UTC
LW: 4 AF: 2
0
AF
we don’t make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks — except for its output, it needs to be strongly boxed, such that it can’t destroy our world by manipulating software or hardware vulnerabilities. similarly, we don’t make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.
Another potential downside of this approach: it places a lot of constraints on the AI itself, which means it probably has to be strongly superintelligent to start working at all.
I think an important desiderata of any alignment plan is that your AI system starts working gradually, with a “capabilities dial” that you (and the aligned system itself) turn up just enough to save the world, and not more.
Intuitively, I feel like an aligned AGI should look kind of like a friendly superhero, whose superpower is weak superintelligence, superhuman ethics, and a morality which is as close as possible to the coherence-weighted + extrapolated average morality of all currently existing humans (probably not literally; I’m just trying to gesture at a general thing of averaging over collective extrapolated volition / morality / etc.).
Brought into existence, that superhero would then consider two broad classes of strategies:
1. Solve a bunch of hard alignment problems: embedded agency, stable self-improvement, etc. and then, having solved those, build a successor system to do the actual work.
2. Directly do some things with biotech / nanotech / computer security / etc. at its current intelligence level to end the acute risk period. Solve remaining problems at its leisure, or just leave them to the humans.
From my own not-even-weakly superhuman vantage point, (2) seems like a much easier and less fraught strategy than (1). If I were a bit smarter, I’d try saving the world without AI or enhancing myself any further than I absolutely needed to.
Faced with the problem that the boxed AI in the QACI scheme is facing… :shrug:. I guess I’d try some self-enhancement followed by solving problems in (1), and then try writing code for a system that does (2) reliably. But it feels like I’d need to be a LOT smarter to even begin making progress.
Provably safely building the first “friendly superhero” might require solving some hard math and philosophical problems, for which QACI might be relevant or at least in the right general neighborhood. But that doesn’t mean that the resulting system itself should be doing hard math or exotic philosophy. Here, I think the intuition of more optimistic AI researchers is actually right: an aligned human-ish level AI looks closer to something that is just really friendly and nice and helpful, and also super-smart.
(I haven’t seen any plans for building such a system that don’t seem totally doomed, but the goal itself still seems much less fraught than targeting strong superintelligence on the first try.)