Dalcy comments on Towards Hodge-podge Alignment

Dalcy 28 Dec 2022 8:22 UTC
6 points
7
Awesome post! I broadly agree with most of the points and think hodge-podging would be a fairly valuable agenda to further pursue. Some thoughts:
What could AI alignment look like if we had 6000+ full-time researchers and software developers?
My immediate impression is that, if true, this makes hodge-podging fairly well suited for automation (compared to conceptual/theoretical work, based on reasons laid out here)
But when we assemble the various methods, suddenly that works great because there’s a weird synergy between the different methods.
I agree that most synergies would be positive, but the way it was put in this post seems to imply that they would be sort of unexpected. Isn’t the whole point of having primitives & taxonomizing type signatures to ensure that their composition’s behaviors are predictable and robust?
Perhaps I’m uncertain as to what level of “formalization” hodge-podging would be aiming for. If it’s aiming for a fully mathematically formal characterization of various safety properties (eg PreDCA-style) then sure, it would permit lossless provable guarantees of the properties of its composition, as is the case with cryptographic primitives (there are no unexpected synergies from assembling them).
But if they’re on the level of ELK/plausible-training-stories level of formalization, I suspect hodge-podging would less be able to make composition guarantees as the “emergent” features from composing them start to come into the picture. At that point, how can it guarantee that there aren’t any negative synergies the misaligned AI could exploit?
(I might just be totally confused here given that I know approximately nothing about categorical systems theory)
For the next step, I might post a distillation of David Jaz Myers Categorical systems theory which treats dynamic systems and their typed wirings as polymorphic lenses.
Please do!