This paper uses famous problems from philosophy of science and philosophical psychology—underdetermination of theory by evidence, Nelson Goodman’s new riddle of induction, theory-ladenness of observation, and “Kripkenstein’s” rule-following paradox—to show that it is empirically impossible to reliably interpret which functions a large language model (LLM) AI has learned, and thus, that reliably aligning LLM behavior with human values is provably impossible.
So, this seems provisionally to be bullshit because it doesn’t admit of thinking probabilistically or simplicity priors. But I’m not totally sure it’s worthless. Anyone read it in detail?
This paragraph also misses the possibility of constructing a LLM and/or training methodology such that it will learn certain functions, or can’t learn certain functions. There is also a conflation of “reliable” with “provable” on top of that.
Perhaps there is some provision made elsewhere in the text that addresses these objections. Nonetheless, I am not going to search. I found that the abstract smells enough like bullshit to do something else.
https://philpapers.org/rec/ARVIAA
So, this seems provisionally to be bullshit because it doesn’t admit of thinking probabilistically or simplicity priors. But I’m not totally sure it’s worthless. Anyone read it in detail?
This paragraph also misses the possibility of constructing a LLM and/or training methodology such that it will learn certain functions, or can’t learn certain functions. There is also a conflation of “reliable” with “provable” on top of that.
Perhaps there is some provision made elsewhere in the text that addresses these objections. Nonetheless, I am not going to search. I found that the abstract smells enough like bullshit to do something else.