Cole Wyeth comments on Thane Ruthenis’s Shortform

Cole Wyeth 12 Jul 2025 18:17 UTC
4 points
2
Yeah, I like this framing.
I don’t really know how to make it precise, but I suspect that real life has enough hacks and loopholes that it’s hard to come up with plans that knowably don’t have counterplans which a smarter adversary can find, even if you assume that adversary is only modestly smarter. That’s what makes me doubt that what I called adversarially robust augmentation and distillation actually works in practice. I don’t think I have the frames for thinking about this problem rigorously.
- Thane Ruthenis 21 Jul 2025 14:58 UTC
  2 points
  0
  Parent
  Incidentally, your Intelligence as Privilege Escalation is pretty relevant to that picture. I had it in mind when writing that.
- quetzal_rainbow 13 Jul 2025 4:04 UTC
  2 points
  0
  Parent
  The concept of weird machine is the closest to be useful here and an important quetion here is “how to check that our system doesn’t form any weird machine here”.