evhub comments on evhub’s Shortform

evhub 6 Jul 2022 20:46 UTC
LW: 10 AF: 8
0
AF
- Ensembling as an AI safety solution is a bad way to spend down our alignment tax—training another model brings you to 2x compute budget, but even in the best case scenario where the other model is a totally independent draw (which in fact it won’t be), you get at most one extra bit of optimization towards alignment.
- Chain of thought prompting can be thought of as creating an average speed bias that might disincentivize deception.
- evhub 3 Aug 2022 3:40 UTC
  LW: 9 AF: 7
  0
  AF Parent
  - A deceptive model doesn’t have to have some sort of very explicit check for whether it’s in training or deployment any more than a factory-cleaning robot has to have a very explicit check for whether it’s in the jungle instead of a factory. If it someday found itself in a very different situation than currently (training), it would reconsider its actions, but it doesn’t really think about it very often because during training it just looks too unlikely.