Aaron_Scher comments on Proposed Alignment Technique: OSNR (Output Sanitization via Noising and Reconstruction) for Safer Usage of Potentially Misaligned AGI

Aaron_Scher 30 May 2023 18:40 UTC
5 points
4
However, I will be unable to make confident statements about how it’d perform on specific pivotal tasks until I complete further analysis.
Yeah this part seems like a big potential downside, in combination with the “Good plans may also require precision.” problem.
Take for example the task of designing a successor AGI that is aligned; it seems like the code for this could end up being pretty intricate and complex, leading to the following failures: adding noise makes it very hard to reconstruct functional code (and slight mistakes could be catastrophic due to creating misaligned successors), and even if humans can properly reconstruct the code they may not be able to understand it well enough to evaluate the safety of the plan. Maybe your proposal is aimed at higher level plans than this? Maybe solving ai alignment is simply too difficult of a task to be aiming for on the first few tries and your further analysis will reveal specific useful tasks which, done non-maliciously, are less brittle.
Overall, interesting idea and I’m excited to see where further analysis leads