sunwillrise answers Inscrutability was always inevitable, right?

sunwillrise 7 Aug 2025 6:19 UTC
49 points
8
As I understand it, the initial Yudkowskian conception of Friendly AI research^[1] was for a small, math- and science-inclined team that’s been FAI-pilled to first figure out the Deep Math of reflective cognition (see the papers on Tiling Agents as an illustrative example: 1, 2). The point was to create a capability-augmenting recursive self-improvement procedure that preserves the initial goals and values hardcoded into a model (evidence: Web Archive screenshot of the SingInst webpage circa 2006). See also this:
When we try to visualize how all this is likely to go down, we tend to visualize a scenario that someone else once termed “a brain in a box in a basement.” I love that phrase, so I stole it. In other words, we tend to visualize that there’s this AI programming team, a lot like the sort of wannabe AI programming teams you see nowadays, trying to create artificial general intelligence, like the artificial general intelligence projects you see nowadays. They manage to acquire some new deep insights which, combined with published insights in the general scientific community, let them go down into their basement and work in it for a while and create an AI which is smart enough to reprogram itself, and then you get an intelligence explosion.
Then you would figure out a way to encode human values into machine code directly, compute (a rough, imperfect approximation of) humanity’s CEV, and initialize a Seed AI with a ton of “hacky guardrails” (Eliezer’s own term) aimed at enacting it. Initially the AI would be pretty dumb, but:
1. we would know precisely what it’s trying to do, because we would have hardcoded its desires directly.
2. we would know precisely how it would develop, because our Deep Mathematical Knowledge about agency and self-improvement would have resulted in clear mathematical proofs of how it will preserve its goals (and thus its Friendliness) as it self-improved.
3. the hacky guardrails would ensure nothing breaks at the beginning, and as the model got better and its beliefs/actions/desires coherentized, the problems with the approximation of CEV would go away.
So the point is that we might not know the internals of the final version of the FAI; it might be “inscrutable.” But that’s ok, they said, because we’d know with the certainty of mathematical proof that its goals are nonetheless good.
From there on out, you relax, kick back, and plan the Singularity after-party.
1. ^
  Which will likely seem silly and wildly over-optimistic to observers in hindsight, and in my view should have seemed silly and wildy-optimistic at the time too
What links here?
- Q Home's comment on Q Home’s Shortform by Q Home (8 Aug 2025 8:54 UTC; 1 point)
- the gears to ascension 7 Aug 2025 15:45 UTC
  3 points
  1
  Parent
  this was never going to work...
  
  … without the help of an AI that is strong enough to significantly augment the proof research. which we have or nearly have now (may still be a little ways out, but no longer inconceivable). this seems like very much not a dead end, and is the sort of thing I’d expect even an AGI to think necessary in order to solve ASI alignment-to-that-AGI.
  
  exactly what to prove might end up looking a bit different, of course.
  - Garrett Baker 7 Aug 2025 18:54 UTC
    5 points
    2
    Parent
    Why do you think it was never going to work? Even if you think humans aren’t smart enough, intelligence enhancement seems pretty likely.
- Alexander Gietelink Oldenziel 8 Aug 2025 10:41 UTC
  −4 points
  0
  Parent
  MIRI lost the Mandate of Heaven smh
  - Cole Wyeth 8 Aug 2025 21:38 UTC
    10 points
    0
    Parent
    When and why?