JNS comments on ARC tests to see if GPT-4 can escape human control; GPT-4 failed to do so

JNS 15 Mar 2023 7:47 UTC
2 points
1
Not surprising, but good that someone checked to see where we are at.
At the base GPT-4 is a weak oracle with extremely weak level 1 self improvement^[1], I would be massively surprised if such a system did something that even hints at it being dangerous.
The questions I now have, is how much does it enable people to do bad things? A capable human with bad intentions combined with GPT-4, how much “better” would such a human be in realizing those bad intentions?
Edit: badly worded first take
1. ^
  Level 1 amounts to memory.
  Level 2 amounts to improvement of the model, basically adjust of parameters.
  Level 3 change to the model, so bigger, different architecture etc.
  Level 4 change to the underlying computational substrate.
  Level 1+2 would likely be enough to get into dangerous territory (obviously depending on the size of the model, the memory attached, and how much power can be squeezed out of the model).
- sator 15 Mar 2023 9:08 UTC
  2 points
  0
  Parent
  Probably a feature of the current architecture, not a bug. Since we still rely on Transformers that suffer from mode collapse when they’re fine trained, we will probably never see much more than weak level 2 self improvement. Feeding its own output into itself/new instance basically turns it into a Turing machine, so we have now built something that COULD be described by level 4. But then again, we see mode collapse, so the model basically stalls. Plugging its own input into a not fine tuned version probably produces the same result, since the underlying property of mode collapse is emergent by virtue of it having less and less entropy in the input. There might be real risk here in jailbreaking the model to apply randomness on its output, but if this property is applied globally, then the risk of AGI emerging is akin to the Infinte Monkey Theorem.