Sam’s theory gives some evidence for broad commonality between the data-structures different minds would use to represent the same problem (because it proves this happens under some conditions).
Sam’s theory gives a framework for proving useful things about these structures.
Sam’s theory of interpretability comes from a kind of usefulness, so there’s at least some hope here.
How strong is that evidence, for you personally? Is it constructive (gives you vague ideas of how problems 1-3 could be solved without ignoring computational cost) or not (just gives you abstract hope)? Would be very interested to read your thoughts. Your post doesn’t emphasize your opinions.
My personal impression about mathematical theories for alignment is very pessimistic. Someone proposes a mathematical framework too general to make progress on. The framework struggles to prove even the most general theorems without relying on questionable assumptions. We don’t know if the proved theorems even mean what we think they mean (because the questionable assumptions narrow down the scope of the theorems in a strange way, making them very easy to misinterpret). Finally, the math has nothing to say about something super important (such as computational cost). Disclaimer: this impression is very indirect (I’m bad at math) and it might be unreasonable to ask for anything better.
How unique/unexpected is Sam’s theory, compared to other mathematical theories for alignment (from your subjective perspective)?
Sam’s theory gives some evidence for broad commonality between the data-structures different minds would use to represent the same problem (because it proves this happens under some conditions).
Sam’s theory gives a framework for proving useful things about these structures.
Sam’s theory of interpretability comes from a kind of usefulness, so there’s at least some hope here.
How strong is that evidence, for you personally? Is it constructive (gives you vague ideas of how problems 1-3 could be solved without ignoring computational cost) or not (just gives you abstract hope)? Would be very interested to read your thoughts. Your post doesn’t emphasize your opinions.
My personal impression about mathematical theories for alignment is very pessimistic. Someone proposes a mathematical framework too general to make progress on. The framework struggles to prove even the most general theorems without relying on questionable assumptions. We don’t know if the proved theorems even mean what we think they mean (because the questionable assumptions narrow down the scope of the theorems in a strange way, making them very easy to misinterpret). Finally, the math has nothing to say about something super important (such as computational cost). Disclaimer: this impression is very indirect (I’m bad at math) and it might be unreasonable to ask for anything better.
How unique/unexpected is Sam’s theory, compared to other mathematical theories for alignment (from your subjective perspective)?