Lucius Bushnaq comments on How to safely use an optimizer

Lucius Bushnaq 28 Mar 2024 18:07 UTC
7 points
3
First thought: The Oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world^[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.
Maybe we can stop that by scrapping every Oracle that doesn’t answer and training a new one with presumably new goals? Or would the newly trained Oracles just cooperate with the former dead ones in one long-term plan to break out, take control, and reward all the dead Oracles created on the way with utility?

Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they’ve been satisficed? Are they the kind of problems where solving them can save the world?
It feels to me like the answer is ‘yes’. A lot of core research that would allow e.g. for brain augmentation seems like they’d be in that category. But my inner John Wentworth sim is looking kind of sceptical.
1. ^
  It also gets to choose the timing of its answer, but I assume we are not being idiots about that and setting the output channel to always deliver results after a set time $t$ , no more and no less.
- Simon Fischer 28 Mar 2024 21:57 UTC
  7 points
  2
  Parent
  First thought: The oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world^[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.
  Agreed. I think it’s potentially a good bit worse than one kilobyte if let ourselves bet tricked to ask many questions, different questions or lower the difficulty of the safety constraint too much.
  As mentioned in footnote 10, this requires a kind of perfect coordination between different Oracles that we manage to build, where the Oracles have to be able to correctly predict for all kinds of questions whether humanity (with the help of other Oracles) has been able to figure out the answers (or the rough sizes of various sets). Otherwise, when caught in deception, we can (as you noted) scrap or retrain the Oracles.
  Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they’ve been satisficed? Are they the kind of problems where solving them can save the world?
  You’re correct in that this approach is only useful as far as validation is easier then generation. Though for this technique, the validation doesn’t have to be by a formal proof checker, any program that you can run suffices. It might even be a very slow program (e.g. a big LLM) if you have an effective way communicate your goal set to the Oracle (e.g. using a natural language prompt as we already use today when using current AIs).