I’ve only just realised that a key part of the AI alignment problem is essentially Wittgenstein’s rule-following argument. (Maybe obvious, but I’ve never seen this stated before.)
His rule-following argument claims that it’s impossible to define a term unambiguously, whether by examples or rules or using other terms; indeed any definition is so ambiguous as to be consistent with any future application of the term. So you can’t even teach someone ‘+’ in such a way that when following your definition/rule/algorithm they will give your desired answer to a sum they haven’t seen before, eg 1000 + 1000 = 2000. They could just as ‘correctly’ give 3000 or −45.7 or pi. (I won’t explain why here.)
Cf no amount of training an AI to be ‘good’ etc will ensure that it remains so in novel situations.
I’m not convinced Wittgenstein was right (and argued against the rule-following argument for my philosophy masters FWIW); maybe a real philosopher more familiar with the topic could apply it usefully to AI alignment.
I’ve only just realised that a key part of the AI alignment problem is essentially Wittgenstein’s rule-following argument. (Maybe obvious, but I’ve never seen this stated before.)
His rule-following argument claims that it’s impossible to define a term unambiguously, whether by examples or rules or using other terms; indeed any definition is so ambiguous as to be consistent with any future application of the term. So you can’t even teach someone ‘+’ in such a way that when following your definition/rule/algorithm they will give your desired answer to a sum they haven’t seen before, eg 1000 + 1000 = 2000. They could just as ‘correctly’ give 3000 or −45.7 or pi. (I won’t explain why here.)
Cf no amount of training an AI to be ‘good’ etc will ensure that it remains so in novel situations.
I’m not convinced Wittgenstein was right (and argued against the rule-following argument for my philosophy masters FWIW); maybe a real philosopher more familiar with the topic could apply it usefully to AI alignment.