Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 23 Jan 2025 13:24 UTC
3 points
0
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
Agreed on the rest of points!
- julius vidal 23 Jan 2025 23:37 UTC
  1 point
  0
  Parent
  >As originally conceived, this is sort of like a “dangerous capability” eval for steg.
  
  I am actually just about to start building something very similar to this for the AISI’s evals bounty program.