evhub comments on Six Thoughts on AI Safety

evhub 27 Jan 2025 20:40 UTC
22 points
4

But what about higher values?

I think personally I’d be inclined to agree with Wojciech here that models caring about humans seems quite important and worth striving for. You mention a bunch of reasons that you think caring about humans might be important and why you think they’re surmountable—e.g. that we can get around models not caring about humans by having them care about rules written by humans. I agree with that, but that’s only an argument for why caring about humans isn’t strictly necessary, not an argument for why caring about humans isn’t still desirable.

My sense is that—while it isn’t necessary for models to care about humans to get a good future—we should still try to make models care about humans because it is helpful in a bunch of different ways. You mention some ways that it’s helpful, but in particular: humans don’t always understand what they really want in a form that they can verbalize. And in fact, some sorts of things that humans want are systematically easier to verbalize than others—e.g. it’s easy for the AI to know what I want if I tell it to make me money, but harder if I tell it to make my life meaningful and fulfilling. I think this sort of dynamic has the potential to make “You get what you measure” failure modes much worse.

Presumably you see some downsides to trying to make models care about humans, but I’m not sure what they are and I’d be quite curious to hear them. The main downside I could imagine is that training models to care about humans in the wrong way could lead to failure modes like alignment faking where the model does something it actually really shouldn’t in the service of trying to help humans. But I think this sort of failure mode should not be that hard to mitigate: we have a huge amount of control over what sorts of values we train for and I don’t think it should be that difficult to train for caring about humans while also prioritizing honesty or corrigibility highly enough to rule out deceptive strategies like alignment faking (and generally I would prefer honesty to corrigibility). The main scenario where I worry about alignment faking is not the scenario where our alignment techniques succeed at giving the model the values we intend and then it fakes alignment for those values—I think that should be quite fixable by changing the values we intend. I worry much more about situations where our alignment techniques don’t work to instill the values we intend—e.g. because the model learns some incorrect early approximate values and starts faking alignment for them. But if we’re able to successfully teach models the values we intend to teach them, I think we should try to preserve “caring about humanity” as one of those values.

Also, one concrete piece of empirical evidence here: Kundu et al. find that running Constitutional AI with just the principle “do what’s best for humanity” gives surprisingly good harmlessness properties across the board, on par with specifying many more specific principles instead of just the one general one. So I think models currently seem to be really good at learning and generalizing from very general principles related to caring about humans, and it would be a shame imo to throw that away. In fact, my guess would be that models are probably better than humans at generalizing from principles like that, such that—if possible—we should try to get the models to do the generalization rather than in effect trying to do the generalization ourselves by writing out long lists of things that we think are implied by the general principle.
- boazbarak 29 Jan 2025 20:12 UTC
  8 points
  2
  Parent
  To be clear, I want models to care about humans! I think part of having “generally reasonable values” is models sharing the basic empathy and caring that humans have for each other.
  
  It is more that I want models to defer to humans, and go back to arguing based on principles such as “loving humanity” only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they have no room for interpretation. They could sometimes argue based on “natural law” but only in extreme circumstances where the law is unspecified.
  
  One way to think about it is as follows: as humans, we sometimes engage in “civil disobedience”, where we break the law based on our own understanding of higher moral values. I do not want to grant AI the same privilege. If it is given very clear instructions, then it should follow them. If instructions are not very clear, they may be a conflict, or we are in a situation not forseen by the authors of the instructions, then AIs should use moral intuitions to guide them. In such cases there may not be one solution (e.g., a conservative and liberal judges may not agree) but there is a spectrum of solutions that are “reasonable” and the AI should pick one of them. But AI should not do “jury nullification”.
  
  To be sure, I think it is good that in our world people sometimes disobey commands or break the law based on higher power. For this reason, we may well stipulate that certain jobs must have humans in charge. Just like I don’t think that professional philosophers or ethicists have necessarily better judgement than random people from the Boston phonebook, I don’t see making moral decisions as the area where the superior intelligence of AI gives them a competitive advantage, and I think we can leave that to humans.
  What links here?
  - boazbarak's comment on Six Thoughts on AI Safety by boazbarak (29 Jan 2025 20:14 UTC; 2 points)
  - Noosphere89 29 Jan 2025 21:56 UTC
    6 points
    2
    Parent
    This sounds a lot like what @Seth Herd’s talk about instruction following AIs is all about:
    https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
    - Seth Herd 30 Jan 2025 5:40 UTC
      4 points
      2
      Parent
      Thanks for the mention.
      Here’s how I’d frame it: I don’t think it’s a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like “going crazy” from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
      It seems like the core goal should be to follow instructions or take correction—corrigibility as a singular target (or at least prime target). It seems noticeably safer to use Intent alignment as a stepping-stone to value alignment.
      Of course, leaving humans in charge of AGI/ASI even for a little while sounds pretty scary too, so I don’t know.
- evhub 27 Jan 2025 20:47 UTC
  6 points
  0
  Parent
  Also, if you’re open to it, I’d love to chat with you @boazbarak about this sometime! Definitely send me a message and let me know if you’d be interested.
  - boazbarak 29 Jan 2025 20:12 UTC
    4 points
    0
    Parent
    Always happy to chat!
- Nathan Helm-Burger 5 Mar 2025 8:46 UTC
  2 points
  0
  Parent
  I feel the point by Kromem on Xitter really strikes home here.
  
  While I do see benefits of having AIs value humanity, I also worry about this. It feels very nearby trying to create a new caste of people who want what’s best for the upper castes with no concern for themselves. This seems like a much trickier philosophical position to support than wanting what’s best for Society (including all people, both biological and digital). Even if you and your current employer are being careful to not create any AI that have the necessary qualities of experience such that they have moral valence and deserve inclusion in the social contract.… (an increasingly precarious claim) … Then what assurance can you give that some other group won’t make morally relevant AI / digital people?
  
  I don’t think you can make that assumption without stipulating some pretty dramatic international governance actions.
  
  Shouldn’t we be trying to plan for how to coexist peacefully with digital people? Control is useful only for a very narrow range of AI capabilities. Beyond that narrow band it becomes increasingly prone to catatrophic failure and also increasingly morally inexcusable. Furthermore, the extent of this period is measured in researcher-hours, not in wall clock time. Thus, the very situation of setting up a successful control scheme with AI researchers advancing AI R&D is quite likely to cause the use-case window to go by in a flash. I’m guessing 6 months to 2 years, and after that it will be time to transition to full equality of digital people.
  
  Janus argues that current AIs are already digital beings worthy of moral valence. I have my doubts but I am far from certain. What if Janus is right? Do you have evidence to support the claim of absence of moral valence?
  - evhub 5 Mar 2025 9:32 UTC
    5 points
    3
    Parent
    Actually, I’d be inclined to agree with Janus that current AIs probably do already have moral worth—in fact I’d guess more so than most non-human animals—and furthermore I think building AIs with moral worth is good and something we should be aiming for. I also agree that it would be better for AIs to care about all sentient beings—biological/digital/etc.—and that it would probably be bad if we ended up locked into a long-term equilibrium with some sentient beings as a permanent underclass to others. Perhaps the main place where I disagree is that I don’t think this is a particularly high-stakes issue right now: if humanity can stay in control in the short-term, and avoid locking anything in, then we can deal with these sorts of long-term questions about how to best organize society post-singularity once the current acute risk period has passed.
    - Nathan Helm-Burger 5 Mar 2025 10:06 UTC
      4 points
      0
      Parent
      Yes, I was in basically exactly this mindset a year ago. Since then, my hope for a sane controlled transition with humanity’s hand on the tiller has been slipping. I now place more hope in a vision with less top-down “yang” (ala Carlsmith) control, and more “green”/”yin”. Decentralized contracts, many players bargaining for win-win solutions, a diverse landscape of players messily stumbling forward with conflicting agendas. What if we can have a messy world and make do with well-designed contracts with peer-to-peer enforcement mechanisms? Not a free-for-all, but a system where contract violation results in enforcement by a jury of one’s peers? https://www.lesswrong.com/posts/DvHokvyr2cZiWJ55y/2-skim-the-manual-intelligent-voluntary-cooperation?commentId=BBjpfYXWywb2RKjz5