TristanTrim comments on Legible vs. Illegible AI Safety Problems

TristanTrim 6 Nov 2025 15:34 UTC
3 points
0
The “morality is scary” problem of corrigible AI is an interesting one. Seems tricky to at least a first approximation in that I basically don’t have an estimate on how much effort it would take to solve it.

Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.

My own thinking on the subject is closely related to my “Outcome Influencing System (OIS)” concept. Most complete and concise summary here. I should write an explainer post, but haven’t gotten to it yet.

Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn’t really solve the problem, it just backs it up one matryoshka doll around the AI.
- Ebenezer Dukakis 7 Nov 2025 5:22 UTC
  3 points
  0
  Parent
  
  Your rot13 suggestion has the obvious corruption problem, but also has the problem of public relations for the plan. I doubt it would be popular. However, I like where your head is at.
  
  My suggestion is not supposed to be the final idea. It’s just supposed to be an improvement over what appears to be Wei Dai’s implicit idea, of having philosophers who have some connection to AGI labs solve these philosophical issues, and hardcode solutions in so they can’t be changed.
  
  (Perhaps you could argue that Wei Dai’s implicit idea is better, because there’s only a chance that these philosophers will be listened to, and even then it will be in the distant future. Maybe those conditions keep philosophers honest. But we could replicate those conditions in my scenario as well: Randomly generate 20 different groups of philosophers, then later randomly choose 1 group to act on their conclusions, and only act on their conclusions after a 30-year delay.)
  
  Basically, whatever system we use for deciding on and controlling the corrigible AI becomes the system we are concerned with ensuring the alignment of. It doesn’t really solve the problem, it just backs it up one matryoshka doll around the AI.
  
  I’m not convinced they are the same problem, but I suppose it can’t hurt to check if ideas for the alignment problem might also work for the “morality is scary” problem.
  - TristanTrim 7 Nov 2025 14:52 UTC
    1 point
    0
    Parent
    I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That’s a useful thing to keep in mind, but so is what keeps them from being final ideas.
    
    I’m not convinced they are the same problem
    
    When viewed as OISs from a high level, they are the same problem. Misaligned OIS to misaligned OIS. But you are correct that many of the details change. The properties of one OIS are quite different from the properties of the other, and that does matter for analyzing and aligning them. I think that having a model that applies to both of them and makes the similarities and differences more explicit would be useful (my suggestion is my OIS model, but it’s entirely possible there are better ones).
    
    It seems like considerations to “keep philosophers honest” are implicitly talking about how to ensure alignment of a hypothetical socio-technical OIS. What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I’m uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing.
    
    I do think the alignment problem and the “morality is scary” problem have a lot in common, and in my thinking about the alignment problem and the way it leaks into other problems, the model that emerged for me was that of OIS, which seem to generalize the part of the alignment problem that I am interested in focusing on to social institutions who’s goals are moral in nature, and how they relate to the values of individual people.
    - Ebenezer Dukakis 8 Nov 2025 3:44 UTC
      1 point
      0
      Parent
      
      I definitely like the directions you are exploring in and I agree they are improvements over the implicit AGI lab directed concept. That’s a useful thing to keep in mind, but so is what keeps them from being final ideas.
      
      +1
      
      What do you think? Does that make sense at all, or maybe it seems more like a time wasting distraction? I have to admit I’m uncomfortable with the amount I have gotten stuck on the idea that championing this concept is a useful thing for me to be doing.
      
      Glad you’re self-aware about this. I would focus less on championing the concept, and more on treating it as a hypothesis about a research approach which may or may not deliver benefits. I wouldn’t evangelize until you’ve got serious benefits to show, and show those benefits first (with the concept that delivered those benefits as more of a footnote).
      - TristanTrim 10 Nov 2025 16:59 UTC
        1 point
        0
        Parent
        I think the focus on “delivering benefits” is a good perspective. It feels complicated by my sense that a lot of the benefit of OIS is as an explanatory lens. When I want to discuss things I’m focused on, I want to discuss in terms of OIS and it feels like not using OIS terminology makes explanations more complicated. So in that regard I guess I need to clearly define and demonstrate the explanatory benefit. But the “research approach” focus also seems like a good thing to keep in mind.
        Thanks for your perspective 🙏