habryka comments on Common misconceptions about OpenAI

habryka 1 Sep 2022 20:57 UTC
LW: 9 AF: 3
3
AF
Here is an example quote from the latest OpenAI blogpost on AI Alignment:

Language models are particularly well-suited for automating alignment research because they come “preloaded” with a lot of knowledge and information about human values from reading the internet. Out of the box, they aren’t independent agents and thus don’t pursue their own goals in the world. To do alignment research they don’t need unrestricted access to the internet. Yet a lot of alignment research tasks can be phrased as natural language or coding tasks.

This sounds super straightforwardly to me like the plan of “we are going to train non-agentic AIs that will help us with AI Alignment research, and will limit their ability to influence the world, by e.g. not giving them access to the internet”. I don’t know whether “boxing” is the exact right word here, but it’s the strategy I was pointing to here.
- Rohin Shah 2 Sep 2022 10:29 UTC
  LW: 4 AF: 3
  0
  AF Parent
  The immediately preceding paragraph is:
  Importantly, we only need “narrower” AI systems that have human-level capabilities in the relevant domains to do as well as humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.
  I would have guessed the claim is “boxing the AI system during training will be helpful for ensuring that the resulting AI system is aligned”, rather than “after training, the AI system might be trying to pursue its own goals, but we’ll ensure it can’t accomplish them via boxing”. But I can see your interpretation as well.
  - habryka 3 Sep 2022 0:05 UTC
    LW: 7 AF: 2
    3
    AF Parent
    Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
    
    I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense. In-general, I expect bad things to happen during training, and separately, a lot of the problems that I have with training things on the internet is that it’s an environment that seems like it would incentivize a lot of agency and make supervision really hard because you have a ton of permanent side effects.
    - Rohin Shah 3 Sep 2022 14:06 UTC
      LW: 4 AF: 3
      0
      AF Parent
      Oh you’re making a claim directly about other people’s approaches, not about what other people think about their own approaches. Okay, that makes sense (though I disagree).
      Oh, I do think a bunch of my problems with WebGPT is that we are training the system on direct internet access.
      I agree that “train a system with internet access, but then remove it, then hope that it’s safe”, doesn’t really make much sense.
      I was suggesting that the plan was “train a system without Internet access, then add it at deployment time” (aka “box the AI system during training”). I wasn’t at any point talking about WebGPT.