abramdemski comments on Refactoring Alignment (attempt #2)

abramdemski 2 Aug 2021 19:09 UTC
LW: 4 AF: 3
0
AF
Maybe a very practical question about the diagram: is there a REASON for there to be no “sufficient together” linkage from “Intent Alignment” and “Robustness” up to “Behavioral Alignment”?
Leaning hard on my technical definitions:
- Robustness: Performing well on the base objective in a wide range of circumstances.
- Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don’t want to get into exactly what “alignment” means.)
These two together do not quite imply behavioral alignment, because it’s possible for a model to have a human-friendly mesa-objective but be super bad at achieving it, while being super good at achieving some other objective.
So, yes, there is a little bit of gear-grinding if we try to combine the two plans like that. They aren’t quite the right thing to fit together.
It’s like we have a magic vending machine that can give us anything, and we have a slip of paper with our careful wish, and we put the slip of paper in the coin slot.
That being said, if we had technology for achieving both intent alignment and robustness, I expect we’d be in a pretty good position! I think the main reason not to go after both is that we may possibly be able to get away with just one of the two paths.
- JenniferRM 15 Aug 2021 23:21 UTC
  2 points
  0
  Parent
  If they are equivalent, then I feel like the obvious value of the work would make resource constraints go away?
  However, thinking about raising money for it helps to convince me that the proposed linkage has “leaks”.
  Imagine timelines where they had Robustness & Intent Alignment (but there was no point where they had “Inner Robustness” or “On-Distribution Alignment”). Some of those timelines might have win conditions, and others might now. The imaginable failures work for me as useful intuition pumps.
  I haven’t managed to figure out a clean short response here, so I’ll give you apologies and lots of words <3
  ...
  If I was being laconic, I might try to restate what I think I noticed is that BOTH “Inner Alignment” and “Objective Robustness” have in some deep sense solved the principle agent problem...
  ...but only Inner Alignment has solved the “time less multi-agent case”, while Objective Robustness has solved the principle agent problem for maybe only the user, only at the moment the user requests help?
  (I can imagine the people who are invested in the yellow or the red options rejecting this for various reasons, but I think it would be interesting to hear the objections framed in terms of principals, agents, groups, contracts, explicable requests for an AGI, and how these could interact over time to foreclose the possibility of very high value winning conditions. Since my laconic response’s best expected followup is more debate, it seems good to sharpen and clarify the point.)
  ...
  Restating “the same insight” in a concretely behavioral form: I think that hypothetically, I would have a lot easier time explicitly and honestly pitching generic (non-altruistic, non-rationalist) investors on an “AGI Startup” if I was aiming for Robustness, rather than Intent Alignment.
  The reason it would be easier: it enables the benefits to go disproportionately to the investors. Like, what if it turns out that disproportionate investor returns are not consistent with something like “the world’s Coherent Extrapolated Volition (or whatever the latest synecdoche for the win condition is)”. THEN, just request “pay out the investors and THEN with any leftovers do the good stuff”. Easy peasy <3
  That is, Robustness is easier to raise funds for, because it increases the pool of possible investors from “saints” to “normal selfish investors”…
  ...which feels like almost an accusation against some people, which is not exactly what I’m aiming for. (I’m not not aiming for that, but its not the goal.) I’ll try again.
  ...
  Restating “again, and at length, and with a suggested modification to the diagram”:
  My intuitions suggest reversing the “coin vs paper” metaphor to make it very vivid and to make metal money be the real good stuff <3
  (If you have not been studying block chain economics by applying security mindset to protocol governance for a while, and kept generating things “not as good as gold” over and over and over, maybe this metaphor won’t make sense to you. It works for ME though?)
  I imagine an “Intent Alignment” that is Actually Good as being like 100 kg of non-radioactive gold.
  You could bury it somewhere, and dig it up 1000 years later, and it would still be just what it is: an accurate theory of pragmatically realizable abstract goodness that is in perfect resonance with the best parts of millennia of human economic and spiritual and scientific history up to the moment it was produced.
  (
  Asteroid mining could change the game for actual gold? And maybe genetic engineering could change the question of values so maybe 150 years from now humans will be twisted demons?
  But assuming no “historically unprecedented changes to our axiological circumstances” Intent Alignment and gold seem metaphorically similar to me (and similarly at risk as a standard able to function for human people as an age old meter stick for goodness in the face of technological plasticity and post-scarcity economics and wireheading and sybil attacks and so on).
  )
  Following this metaphorical flip: “Robustness” becomes the vending machine that will take any paper instruction, and banking credentials you wish to provide (for a bank that is part of westphalian finance and says that you have credit).
  If you pay enough to whoever owns a Robust machine, it’ll give you almost anything…
  ...then the impedance mismatch could be thought of as a problem where the machine doesn’t model the gold plates covered in the Thoughtful Wish as “valuable” (because the gold isn’t held by a bank) though maybe it could work as an awkward and bulky set of instructions that aren’t on paper but then you could do a clever referential rewrite?
  Thus, a simple way to reconcile these things would be for some rich/powerful person to come up, swipe a card to transfer 20 million argentinian nuevo pesos (possibly printed yesterday?) and write the instructions “Do what that 100kg of gold that is stamped and shaped with relevant algorithms and inspiring poetry says to do.”
  Since Robustness will more or less safely-to-the-user do anything that can be done (like it won’t fail to parse “that” in a sloppy and abusive way, for example, triggering on other gold and getting the instruction scrambled, or any of an infinity of other quibbles that could be abusively generated) it will work, right?
  By hypothesis, it has “Objective Robustness” so it WILL robustly achieve any achievable goal (or fail out in some informative way if asked to make 1+2=4 or whatever).
  So then TIME seems to be related to how the pesos and a paper instructions to follow the gold instructions could fail?
  Suppose a Robust vending machine was first asked to create a singleton situation where an AGI exists, manages basically everyone, but isn’t following any kind of Intent Aligned “Golden Plan” that is philosophically coherent n’stuff.
  Since the gears spin very fast, the machine understands that a Golden Plan would be globally inconsistent with its own de facto control of “all the things” already in its relatively pedestrian and selfish way that serve the goals of the first person to print enough pesos, and so it would prevent any such Golden Plan from being carried out later.
  To connect this longer version of a restatement with earlier/shorter restatements, recall the idea of solving the principal/agent problem in the “timeless multi-agent case”...
  In the golden timelessly aligned case, if somehow in the future an actually better theory of “what an AGI should do” is discovered (and so we get “even more gold” in the coin/paper/vending machine metaphor), then Intent Alignment would presumably get out of the way and allow this progress to unfold in an essentially fair and wise way.
  Robustness has no such guarantees. This may get at the heart of the inconsistency?
  Compressing this down to a concrete suggestion to usefully change the diagram:
  I think maybe you could add a 10th node, that was something like “A Mechanism To Ensure That Early Arriving Robustness Defers To Late Arriving Intent Alignment”?
  (In some sense then, the thing that Robustness might lack is “corrigibility to high quality late-arriving Alignment updates”?)
  I’m pretty sure that Deference is not traditionally part of Robustness as normally conceived, but also if such a thing somehow existed in addition to Robustness then I’d feel like: yeah, this is going to work and the three things (Deference, Robustness, and Intent Alignment) might be logically sufficient to guarantee the win condition at the top of the diagram :-)