tailcalled comments on [Intro to brain-like-AGI safety] 14. Controlled AGI

tailcalled 11 May 2022 21:21 UTC
LW: 5 AF: 4
0
AF
Proof strategy #1 starts with the idea that we live in a three-dimensional world containing objects and so on. We try to come up with an unambiguous definition of what those objects are, and from there we can have an unambiguous language for specifying what we want to happen in the world. We also somehow translate (or constrain) the AGI’s understanding of the world into that language, and now we can prove theorems about what the AGI is trying to do.
This is my tentative understanding of what John Wentworth is trying to do via his Natural Abstraction Hypothesis research program (most recent update here), and I’ve heard ideas in this vicinity from a couple other people as well.
I’m skeptical because a 3D world of localized objects seems to be an unpromising starting point for stating and proving useful theorems about the AGI’s motivations. After all, a lot of things that we humans care about, and that the AGI needs to care about, seem difficult to describe in terms of a 3D world of localized objects—consider the notion of “honesty”, or “solar cell efficiency”, or even “daytime”.
This approach is probably particularly characteristic of my approach. I’ve perhaps overstated the similarity of my approach to John Wentworth’s 😅 - I think that much of his research is useful to my approach, but there’s also lots of positions of disagreement. But I suppose everyone finds his research ultra-promising.
A couple of notes:
I think even if my approach doesn’t work out as the sole solution, it seems plausibly complementary to other approaches, including yours. For instance, if you don’t do the sort of ontological lock that I’m advocating, then you tend to end up struggling with some basic symbol-reality distinction, e.g. you’re likely to associate pictures of happy people with the concept of “happiness”, so a happiness maximizer might end up tiling the world with pictures of happy people. My approach can avoid that for free (though the flipside is that it would likely not consider e.g. ems to be people unless explicitly programmed so. but that could probably be achieved.).
I think concepts like “solar cell efficiency” might be very achievable to define by my approach. If you have a clean 3D ontology, you can isolate an object like a solar panel in that ontology, and then counterfactually ask how it would perform under various conditions. So you could say “well how would this object perform if standard sunlight hit it under standard atmospheric conditions? how much power would it produce? would it produce any problematic pollution? etc.”. You could be very precise about this.
… which is of course a curse as much as it is a blessing, e.g. you might not want a precise definition of “daytime”, and it might not be possible for people to write down a precise definition of “honesty”.
- Steven Byrnes 12 May 2022 16:11 UTC
  LW: 4 AF: 2
  0
  AF Parent
  This approach is probably particularly characteristic of my approach.
  Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).
  my approach … ontological lock …
  I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.
  - tailcalled 12 May 2022 17:20 UTC
    2 points
    0
    AF Parent
    Yeah, you were one of the “couple other people” I alluded to. The other was ‪Tan Zhi-Xuan (if I was understanding her correctly during our most recent (very brief) conversation).
    🤔 I wonder if I should talk with Tan Zhi-Xuan.
    I think I know what you’re referring to, but I’m not 100% sure, and other people reading this probably won’t. Can you provide a link? Thanks.
    
    I got the phrase “ontological lock” from adamShimi’s post here, but it only comes up very briefly, so it is not helpful for understanding what I mean and is sort of also me assuming that adamShimi meant the same as I did. 😅 I’m not sure if it’s a term used elsewhere.
    What I mean is forcing the AI to have a specific ontology, such as things embedded in 3D space, so you can directly programmatically interface with the AI’s ontology, rather than having to statistically train an interface (which would lead to problems with distribution shift and such).