Jan_Kulveit comments on Help keep AI under human control: Palisade Research 2026 fundraiser

Jan_Kulveit 19 Dec 2025 16:20 UTC
4 points
2
Also not much contact, but my impression is you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it. Which seems fair to characterize as advocacy work? (Motivated research?)
The diff to your description is the info provided is not only conditional on “the info they’ll find useful” but also somewhat on “will likely move their beliefs toward conclusions Palise hopes them to reach”.
- Jeffrey Ladish 19 Dec 2025 20:05 UTC
  12 points
  8
  Parent
  you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it
  
  I don’t think this frame make a lot of sense. There’s not some clean distinction between “motivated research” and … “unmotivated research”. It’s totally fair to ask whether the research is good, and whether the people doing the research are actually trying to figure out what is true or whether they have written their conclusion at the bottom of the page. But the fact that we have priors doesn’t mean that we have written our conclusion at the bottom of the page!
  
  E.g. Imagine a research group who thinks cigarettes likely cause cancer. They are motivated to show that cigarettes cause cancer, because they think this is true and important. And you could probably guess the results of their studies knowing this motivation. But if they’re good researchers, they’ll also report negative results. They’ll also be careful not to overstate claims. Because while it’s true that cigarettes cause cancer, it would be bad to do publish things that have correct conclusions but bad methodologies! It would hurt the very thing that the researchers care about—accurate understanding of the harms that cigarettes cause!
  
  My colleagues and I do not think current models are dangerous (for the risks we are most concerned about—loss of control risks). We’ve been pretty clear about this. But we think we can learn things about current models that will help us understand risks from future models. I think our chess work and our shutdown resistant work demonstrate some useful existence proofs about reasoning models. They definitely updated my thinking about how RL training shapes AI motivations. And I was often not able to predict in advance what models would do! I did expect models to rewrite the board file given the opportunity and no other way to win. I wasn’t able to predict exactly which models would do this and which would not. I think it’s quite interesting that the rates of this behavior were quite different for different models! I also think it’s interesting how models found different strategies than I expected, like trying to replace their opponent and trying to use their own version of stockfish to get moves. It’s not surprising in retrospect, but I didn’t predict it.
  
  Our general approach is to try to understand the models to the best of our ability and accurately convey the results of our work. We publish all our transcripts and code for these experiments so others can check the work and run their own experiments. We think these existence proofs have important implications for the world, and we speak about these very publicly because of this.