No77e comments on No77e’s Shortform

No77e 26 Feb 2023 18:17 UTC
1 point
0
If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.

But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that’s less hopeless.

The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...