Why do you say this would be the easiest type of AGI to align? This alignment goal doesn’t seem particularly simpler than any other. Maybe a bit simpler than do something all of humanity will like, but more complex than say, following instructions from this one person in the way they intended them.
From a software engineering perspective, misalignment is like a defect or a bug in software. Generally speaking, if a piece of software doesn’t accept any user input is going to have fewer bugs than software that does. For a piece of software that doesn’t accept any input or accepts some constrained user input, it’s possible to formally prove that the software logic is correct. Think specialized software that controls nuclear power plants. To my knowledge, it’s not possible to prove that software that accepts arbitrary unconstrained instructions from a user is defect free.
I claim that the Observer is the easiest ASI to align because it doesn’t accept any instructions after it’s been deployed and has a single very simple goal that avoids dealing with messy things like human happiness, human meaning, human intent, etc. I don’t see how it could get simpler than that.
Fair enough, you have a lot more experience, and I could be totally wrong on this point.
At this point, if I’m going to do anything, it should probably be getting hands on and actually trying to build an aligned system with RLHF or some other method.
Thank you for engaging on this and my previous posts Seth!
Why do you say this would be the easiest type of AGI to align? This alignment goal doesn’t seem particularly simpler than any other. Maybe a bit simpler than do something all of humanity will like, but more complex than say, following instructions from this one person in the way they intended them.
From a software engineering perspective, misalignment is like a defect or a bug in software. Generally speaking, if a piece of software doesn’t accept any user input is going to have fewer bugs than software that does. For a piece of software that doesn’t accept any input or accepts some constrained user input, it’s possible to formally prove that the software logic is correct. Think specialized software that controls nuclear power plants. To my knowledge, it’s not possible to prove that software that accepts arbitrary unconstrained instructions from a user is defect free.
I claim that the Observer is the easiest ASI to align because it doesn’t accept any instructions after it’s been deployed and has a single very simple goal that avoids dealing with messy things like human happiness, human meaning, human intent, etc. I don’t see how it could get simpler than that.
I just don’t think the analogy to software bugs and user input goes very far. There’s a lot more going on in alignment theory.
It seems like “seeing the story out to the end” involves all sorts of vague hard to define things very much like “human happiness” and “human intent”.
It’s super easy to define a variety of alignment goals; the problem is that we wouldn’t like the result of most of them.
Fair enough, you have a lot more experience, and I could be totally wrong on this point.
At this point, if I’m going to do anything, it should probably be getting hands on and actually trying to build an aligned system with RLHF or some other method.
Thank you for engaging on this and my previous posts Seth!