Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 28 May 2022 21:13 UTC
LW: 4 AF: 3
−2
AF
A general principle: if we constrain two neural networks to communicate via natural language, we need some pressure towards ensuring they actually use language in the same sense as humans do, rather than (e.g.) steganographically encoding the information they really care about.
The most robust way to do this: pass the language via a human, who tries to actually understand the language, then does their best to rephrase it according to their own understanding.
What do you lose by doing this? Mainly: you can no longer send messages too complex for humans to understand. Also: in general you can’t backprop through discrete language anyway, but I’d guess there are some tricks for approximating that which don’t work as well when a human is in the loop.
- johnswentworth 29 May 2022 0:34 UTC
  LW: 6 AF: 6
  AF Parent
  That doesn’t actually solve the problem. The system could just encode the desired information in the semantics of some unrelated sentences—e.g. talk about pasta to indicate X = 0, or talk about rain to indicate X = 1.
  - Gunnar_Zarncke 29 May 2022 0:57 UTC
    2 points
    Parent
    I expected you to bring up the Natural Abstraction Hypothesis here. Wouldn’t the communication between the parties naturally use the same concepts?
    - johnswentworth 29 May 2022 2:10 UTC
      4 points
      Parent
      Same concepts yes, but that does not necessarily imply that they’re encoded in the same way as humans typically use language.
- RobertKirk 30 May 2022 13:51 UTC
  LW: 4 AF: 2
  AF Parent
  Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.
- AprilSR 29 May 2022 2:46 UTC
  4 points
  Parent
  Not being able to send messages too complex for humans to understand seems to me like it’s plausibly a benefit for many of the cases where you’d want to do this.
- kave 29 May 2022 0:40 UTC
  1 point
  Parent
  stenographically
  steganographically?
  - Richard_Ngo 30 May 2022 17:00 UTC
    2 points
    Parent
    Ooops, yes, ty.