For a start, low-level deterministic reasoning:
”Obviously I could never influence an agent, but I found some inputs to deterministic biological neural nets that would make things I want happen.”
″Obviously I could never influence my future self, but if I change a few logic gates in this processor, it would make things I want happen.”
eg
[Question] Why not more small, intense research teams?
My impression is that SEALs are exceptional as a team, much less individually. Their main individual skill is extreme team-mindedness.
Seems potentially valuable as an additional layer of capability control to buy time for further control research. I suspect LCDT won’t hold once intelligence reaches some threshold: some sense of agents, even if indirect, is such a natural thing to learn about the world.
Two big issues I see with the prompt:
a) It doesn’t actually end with text that follows the instructions; a “good” output (which GPT-3 fails in this case) would just be to list more instructions.
b) It doesn’t make sense to try to get GPT-3 to talk about itself in the completion. GPT-3 would, to the extent it understands the instructions, be talking about whoever it thinks wrote the prompt.
I agree and was going to make the same point: GPT-3 has 0 reason to care about instructions as presented here. There has to be some relationship to what text follows immediately after the end of the prompt.
Instruction 5 is supererogatory, while instruction 8 is not.
Apply to orgs when you apply to PhDs. If you can work at an org, do it. Otherwise, use PhD to upskill and periodically retry org applications.
You would gain skills while working at a safety org, and the learning would be more in tune with what the problems require.
Probabilistic/inductive reasoning from past/simulated data (possibly assumes imperfect implementation of LCDT):
”This is really weird because obviously I could never influence an agent, but when past/simulated agents that look a lot like me did X, humans did Y in 90% of cases, so I guess the EV of doing X is 0.9 * utility(Y).”
Cf. smart humans in Newcomb’s prob: “This is really weird but if I one box I get the million, if I two-box I don’t, so I guess I’ll just one box.”