As I understand it, the initial Yudkowskian conception of Friendly AI research[1] was for a small, math- and science-inclined team that’s been FAI-pilled to first figure out the Deep Math of reflective cognition (see the papers on Tiling Agents as an illustrative example: 1, 2). The point was to create a capability-augmenting recursive self-improvement procedure that preserves the initial goals and values hardcoded into a model (evidence: Web Archive screenshot of the SingInst webpage circa 2006). See also this:
When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion.
Then you would figure out a way to encode human values into machine code directly, compute (a rough, imperfect approximation of) humanity’s CEV, and initialize a Seed AI with a ton of “hacky guardrails” (Eliezer’s own term) aimed at enacting it. Initially the AI would be pretty dumb, but:
we would know precisely what it’s trying to do, because we would have hardcoded its desires directly.
we would know precisely how it would develop, because our Deep Mathematical Knowledge about agency and self-improvement would have resulted in clear mathematical proofs of how it will preserve its goals (and thus its Friendliness) as it self-improved.
the hacky guardrails would ensure nothing breaks at the beginning, and as the model got better and its beliefs/actions/desires coherentized, the problems with the approximation of CEV would go away.
So the point is that we might not know the internals of the final version of the FAI; it might be “inscrutable.” But that’s ok, they said, because we’d know with the certainty of mathematical proof that its goals are nonetheless good.
Which will likely seem silly and wildly over-optimistic to observers in hindsight, and in my view should have seemed silly and wildy-optimistic at the time too
… without the help of an AI that is strong enough to significantly augment the proof research. which we have or nearly have now (may still be a little ways out, but no longer inconceivable). this seems like very much not a dead end, and is the sort of thing I’d expect even an AGI to think necessary in order to solve ASI alignment-to-that-AGI.
exactly what to prove might end up looking a bit different, of course.
As I understand it, the initial Yudkowskian conception of Friendly AI research[1] was for a small, math- and science-inclined team that’s been FAI-pilled to first figure out the Deep Math of reflective cognition (see the papers on Tiling Agents as an illustrative example: 1, 2). The point was to create a capability-augmenting recursive self-improvement procedure that preserves the initial goals and values hardcoded into a model (evidence: Web Archive screenshot of the SingInst webpage circa 2006). See also this:
Then you would figure out a way to encode human values into machine code directly, compute (a rough, imperfect approximation of) humanity’s CEV, and initialize a Seed AI with a ton of “hacky guardrails” (Eliezer’s own term) aimed at enacting it. Initially the AI would be pretty dumb, but:
we would know precisely what it’s trying to do, because we would have hardcoded its desires directly.
we would know precisely how it would develop, because our Deep Mathematical Knowledge about agency and self-improvement would have resulted in clear mathematical proofs of how it will preserve its goals (and thus its Friendliness) as it self-improved.
the hacky guardrails would ensure nothing breaks at the beginning, and as the model got better and its beliefs/actions/desires coherentized, the problems with the approximation of CEV would go away.
So the point is that we might not know the internals of the final version of the FAI; it might be “inscrutable.” But that’s ok, they said, because we’d know with the certainty of mathematical proof that its goals are nonetheless good.
From there on out, you relax, kick back, and plan the Singularity after-party.
Which will likely seem silly and wildly over-optimistic to observers in hindsight, and in my view should have seemed silly and wildy-optimistic at the time too
this was never going to work...
… without the help of an AI that is strong enough to significantly augment the proof research. which we have or nearly have now (may still be a little ways out, but no longer inconceivable). this seems like very much not a dead end, and is the sort of thing I’d expect even an AGI to think necessary in order to solve ASI alignment-to-that-AGI.
exactly what to prove might end up looking a bit different, of course.
Why do you think it was never going to work? Even if you think humans aren’t smart enough, intelligence enhancement seems pretty likely.
MIRI lost the Mandate of Heaven smh
When and why?