TsviBT comments on Help keep AI under human control: Palisade Research 2026 fundraiser

TsviBT 19 Dec 2025 19:25 UTC
21 points
10
I don’t want to give an overall opinion [ETA: because I haven’t looked into Palisade specifically at all, and am not necessarily wanting to invest the requisite effort into having a long-term worked-out opinion on the whole topic], but some considerations:
- You can just say what was actually shown? And build a true case for scariness from the things that were actually shown?
  - You can’t show “XYZ scary thing is what happens when you have a true smarter than human mind” right now, because those don’t exist. But you can show “trained systems are able to break human-design security systems in surprising ways” or “trained systems can hide thoughts” or “trained systems can send hidden messages” or “big searches find powerful solutions that humans didn’t think of and that have larger effects in the domain than what human designers got” etc.
  - Sometimes the reason your “experiment” is good at communicating is that you are correctly breaking someone’s innocent hopeful expectation that “AIs are tools designed by humans so they’ll just do what’s useful”. But if you oversell it as “AIs will convergently instrumentally do XYZ scary thing”, this is not necessary, because that isn’t actually that relevant to why it was important for communication, and it gets much of the badness of lying, because it’s not really true.
- It’s just deeply bad to deceive, ethically / ruleswise, and morally / consequences-wise, and virtuously / self-construction-wise.
- There’s a bunch of bad consequences to deceiving. This really shouldn’t have to be belabored all the time, we should just know this by now, but for example:
  - Later someone might patch the fake signals of misalignment. Then they can say “look we solved alignment”. And you’ll get accused of goalpost moving.
  - There will be more stronger counterarguments / sociopolitical pressure, if serious regulation is on the table. Then weaker arguments are more likely to be exposed.
  - You communicate incorrectly about what it is that’s dangerous ---> the regulation addresses the wrong thing and doesn’t know how to notice to try to address the right thing.
  - If exposed, you’ve muddied the waters in general / made the market for strategic outlooks/understanding on AI be a market for lemons.
- (I’m being lazy and not checking that I actually have an example of the following, and often that means the thing isn’t real, but: [<--RFW]) Maybe the thing is to use examples for crossing referential but not evidential/inferential distances.
  - That is, sometimes I’m trying to argue something to someone, and they don’t get some conceptual element X of my argument. And then we discuss X, and they come up with an example of X that I think isn’t really an example of X. But to them, their example is good, and it clicks for them what I mean, and then they agree with my argument or at least get the argument and can debate about it more directly. I might point out that I disagree with the example, but not necessarily, and move on to the actual discussion. I’ll keep in mind the difference, because it might show up again—e.g. it might turn out we still weren’t talking about the same thing, or had some other underlying disagreement.
  - I’m unsure whether this is bad / deceptive, but I don’t think so, because it looks/feels like “just what happens when you’re trying to communicate and there’s a bunch of referential distance”.
  - For example, talking about instrumental convergence and consequentialist reasoning is confusing / abstract / ungrounded to many people, even with hypothetical concrete examples; actual concrete examples are helpful. Think of people begging you to give any example of “AI takeover” no matter how many times you say “Well IDK specifically how they’d do it / what methods would work”. You don’t have to be claiming that superviruses are somehow a representative / mainline takeover method, to use that as an example to communicate what general kind of thing you’re talking about. Though it does have issues...
  - (This is related to the first bullet point in this comment.)
  - So I don’t think it’s necessarily bad/deceptive to be communicating by looking for examples that make things click in other people. But I do think it’s very fraught for the above reasons. Looking for clicks is good if it’s crossing referential distance, but not if it’s looking for crossing evidential/inferential distance.
- Raemon 19 Dec 2025 19:33 UTC
  4 points
  2
  Parent
  I am interested in you actually looking at the paper in question and saying which of these apply in this context.
  - TsviBT 19 Dec 2025 19:37 UTC
    2 points
    0
    Parent
    (That’s a reasonable interest but not wanting to take that time is part of my not wanting to give an overall opinion about the specific situation with Palisade; I don’t have an opinion about Palisade, and my comment just means to discuss general principles.)
    - Eli Tyre 19 Dec 2025 21:00 UTC
      7 points
      2
      Parent
      I don’t have an opinion about Palisade, and my comment just means to discuss general principles.
      I think if you included literally this sentence at the top of your comment above, that would be a little bit clearer. Reading your comment, I was trying to figure out if you were claiming that the chess hacking research in particular runs afoul of these principles or does a good job meeting them, or something else.
      
      (You do say “I don’t want to give an overall opinion, but some considerations:”, but nevertheless.)
      - TsviBT 19 Dec 2025 21:39 UTC
        7 points
        2
        Parent
        Added clarification. (This seems to be a quite general problem of mismatched discourse expectations, where a commenter presumes a different shared-presumption of how much the comment is context-specific, or other things like that, compared to a reader.)
    - Raemon 19 Dec 2025 19:52 UTC
      6 points
      2
      Parent
      Reasonable.
      But, fwiw I think this is actually a particularly good time to talk about principles-as-they-interface-with-reality.
      I think Palisade is a pretty good example of an org that AFAICT cares about as much about good group epistemics as it’s possible to care about while still also being an advocacy org trying to get something done For Real. (and/or, I think it makes sense for LW-people to hold Palisade to whatever the highest standards they think are reasonable to hold it to while expecting them to also get something done)
      So, I think this is a particularly good time to talk through “okay what actually are those standards, in real life practice?”.