ESRogs comments on Alignment As A Bottleneck To Usefulness Of GPT-3

ESRogs 21 Jul 2020 20:59 UTC
LW: 22 AF: 12
0
AF
I wonder how long we’ll be in the “prompt programming” regime. As Nick Cammarata put it:
We should actually be programming these by manipulating the hidden layers and prompts are a stand-in until we can.
My guess is that OpenAI will pretty quickly (within the next year) find a much better way to interface with what GPT-3 has learned.
Do others agree? Any reason to think that wouldn’t be possible (or wouldn’t give significant benefits)?
- Andy Jones 22 Jul 2020 8:22 UTC
  6 points
  0
  Parent
  I think that going forward there’ll be a spectrum of interfaces to natural language models. At one end you’ll have fine-tuning, and at the other you’ll have prompts. The advantage of fine-tuning is that you can actually apply an optimizer to the task! The advantage of prompts is anyone can use them.
  In the middle of the spectrum, two things I expect are domain-specific tunings and intermediary models. By ‘intermediary models’ I mean NLP models fine-tuned to take a human prompt over a specific area and return a more useful prompt for another model, or a set of activations or biases that prime the other model for further prompting.
  The ‘specific area’ could be as general as ‘less flights of fancy please’.
- johnswentworth 21 Jul 2020 21:50 UTC
  LW: 4 AF: 2
  0
  AF Parent
  The problem with directly manipulating the hidden layers is reusability. If we directly manipulate the hidden layers, then we have to redo that whenever a newer, shinier model comes out, since the hidden layers will presumably be different. On the other hand, a prompt is designed so that human writing which starts with that prompt will likely contain the thing we want—a property mostly independent of the internal structure of the model, so presumably the prompt can be reused.
  I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the “corresponding” internal notion. Assuming that the first model has learned a real pattern which is actually present in the environment, we should expect that “better” models will also have some structure corresponding to that pattern—otherwise they’d lose predictive power on at least the cases where that pattern applies. Ideally, this would all happen in such a way that the second model can be more accurate, and that increased accuracy would be used.
  In the shorter term, I agree OpenAI will probably come up with some tricks over the next year or so.
  - FactorialCode 22 Jul 2020 21:31 UTC
    5 points
    0
    Parent
    
    I think the eventual solution here (and a major technical problem of alignment) is to take an internal notion learned by one model (i.e. found via introspection tools), back out a universal representation of the real-world pattern it represents, then match that real-world pattern against the internals of a different model in order to find the “corresponding” internal notion.
    
    Can’t you just run the model in a generative mode associated with that internal notion, then feed that output as a set of observations into your new model and see what lights up in it’s mind? This should work as long as both models predict the same input modality. I could see this working pretty well for matching up concepts between the latent spaces of different VAEs. Doing this might be a bit less obvious in the case of autoregressive models, but certainly not impossible.
    - johnswentworth 22 Jul 2020 22:03 UTC
      3 points
      0
      Parent
      This works if both (a) both models are neural nets, and (b) the “concept” cleanly corresponds to one particular neuron. You could maybe loosen (b) a bit, but the bottom line is that the nets have to represent the concept in a particular way—they can’t just e.g. run low-level physics simulations in order to make predictions. It would probably allow for some cool applications, but it wouldn’t be a viable long-term path for alignment with human values.
      - FactorialCode 22 Jul 2020 22:52 UTC
        1 point
        2
        Parent
        I think you can loosen (b) quite a bit if you task a separate model with “delineating” the concept in the new network. The procedure does effectively give you access to infinite data, so the boundary for the old concept in the new model can be as complicated as your compute budget allows. Up to and including identifying high level concepts in low level physics simulations.
        johnswentworth 22 Jul 2020 23:46 UTC
        3 points
        0
        Parent
        We currently have no criteria by which to judge the performance of such a separate model. What do we train it to do, exactly? We could make up some ad-hoc criterion, but that suffers from the usual problem of ad-hoc criteria: we won’t have a reliable way to know in advance whether it will or will not work on any particular problem or in any particular case.
        FactorialCode 23 Jul 2020 0:38 UTC
        3 points
        0
        Parent
        The way I was envisioning it is that if you had some easily identifiable concept in one model, e.g. a latent dimension/feature that corresponds to the log odd of something being in a picture, you would train the model to match the behaviour of that feature when given data from the original generative model. Theoretically any loss function will do as long as the optimum corresponds to the situation where your “classifier” behaves exactly like the original feature in the old model when both of them are looking at the same data.
        
        In practice though, we’re compute bound and nothing is perfect and so you need to answer other questions to determine the objective. Most of them will be related to why you need to be able to point at the original concept of interest in the first place. The acceptability of misclassifying any given input or world-state as being or not being an example of the category of interest is going to depend heavily on things like the cost of false positives/negatives and exactly which situations get misclassified by the model.
        
        The thing about it working or not working is a good point though, and how to know that we’ve successfully mapped a concept would require a degree of testing, and possibly human judgement. You could do this by looking for situations where the new and old concepts don’t line up, and seeing what inputs/world states those correspond to, possibly interpreted through the old model with more human understandable concepts.
        
        I will admit upon further reflection that the process I’m describing is hacky, but I’m relatively confident that the general idea would be a good approach to cross-model ontology identification.
- Zachary Robertson 22 Jul 2020 2:19 UTC
  2 points
  0
  Parent
  [Deleted]
  - johnswentworth 22 Jul 2020 15:40 UTC
    4 points
    0
    Parent
    Any sort of probabilistic model offers the usual interpretations of the probabilities as an interface. For instance, I can train an LDA topic model, look at the words in the learned topics, pick a topic I’m interested in, then look at that topic’s weighting in each document in order to find relevant documents. More generally, I can train any clustering model, pick a cluster I’m interested in, then look for more things in that cluster. Or if I train a causal model, I can often interpret the learned parameters as estimates of physical interactions in the world. In each case, I’m effectively using the interpretation of the model’s built-in probabilities as an interface.
    This is arguably the main advantage of probabilistic models over non-probabilistic models: they come with a fairly reliable, well-understood built-in interface.
    - Zachary Robertson 23 Jul 2020 1:23 UTC
      3 points
      0
      Parent
      [Deleted]
  - ESRogs 22 Jul 2020 6:18 UTC
    2 points
    0
    Parent
    Have we ever figured out a way to interface with what something has learned that doesn’t involve language prompts?
    You might be interested in some of Chris Olah’s work on interpretability. For example, this.
    EDIT: Or even just the example of sampling from the latent space of a variational autoencoder should count, I would think.
    - Zachary Robertson 22 Jul 2020 12:00 UTC
      3 points
      0
      Parent
      [Deleted]