I don’t think this works in the infinite limit. With a truely unlimited amount of compute, insane things happen. I wouldn’t trust that a randomly initialized network wasn’t already a threat.
For example, bulk randomness can produce deterministic-seeming laws over the distribution. (Statistical mechanics). These laws can in turn support the formation and evolution of life.
That or a sufficiently large neural net could just have all sorts of things hiding in it by shear probability.
The win scenario here is that these techniques work well enough that we get LLM’s that can just tell us how to solve alignment properly.
We don’t need it to work in the infinite limit. (Personally, I’m assuming we’ll only be using this to align approximately-human-level research assistants to help us do AI-Assisted Alignment research — so at a level where if we failed, it might not be automatically disastrous.)
I don’t think this works in the infinite limit. With a truely unlimited amount of compute, insane things happen. I wouldn’t trust that a randomly initialized network wasn’t already a threat.
For example, bulk randomness can produce deterministic-seeming laws over the distribution. (Statistical mechanics). These laws can in turn support the formation and evolution of life.
That or a sufficiently large neural net could just have all sorts of things hiding in it by shear probability.
The win scenario here is that these techniques work well enough that we get LLM’s that can just tell us how to solve alignment properly.
We don’t need it to work in the infinite limit. (Personally, I’m assuming we’ll only be using this to align approximately-human-level research assistants to help us do AI-Assisted Alignment research — so at a level where if we failed, it might not be automatically disastrous.)