There is a paradigm missing from the discussion of ASI misalignment existential risk. The threat of ASI generalizes to the concept of “Outcome Influencing Systems” (OISs). My hope is that developing terminology and formalism around this model may mitigate the issues associated with existing terminology and aid in more productive discourse and interdisciplinary research applicable to ASI risk and social coordination issues.
I am currently the only contributor. I think the idea has merit, but I am still at the point where I am seeking either to find collaborators and spread the idea, or to find people who can point out enough flaws in the idea for it to be worth abandoning.
Question about the OIS research direction: What sort of principles might you hope to learn about outcome influencing systems which could help with the problems of ASI?
It seems to me that the problem isn’t OIS generally. We have plenty of safe/aligned OISs. This includes some examples from your OIS doc such as thermostats. Even a lot of non-AI software systems seem like safe/aligned OISs. The problem seems to be more with this specific type of OIS (powerful deep-learning based models) which are both much more capable and much more difficult to ensure the safe behavior of, compared to software with legible code. (I read your doc quickly so may have missed an answer about this which you already provide in there.)
--
Also, thanks for putting forth a novel research direction for addressing AI risk. I think we should have a high bar for which research directions we put scarce resources into, but at the same time it’s super valuable to propose new research directions. What we are looking for could just be low-hanging fruit in the unexplored idea space, overlooked by all the people who are working on more established stuff.
As per your note about directions and scarce resources, I agree. I hope the OIS agenda is not a waste of time, and if it is, I hope you can help me identify so quickly! Scout’s mindset.
Sorry, in trying to respond to your question I wrote quite a bit. Feel free to skim it and see if any of it hits the mark of what you were trying to get at. Sorry if it feels like a repeat of things I already said in my doc.
First, not a principle, but an example of the use of the OIS lens, you identified “powerful deep-learning based models” as the dangerous type of OIS, and I think they are dangerous, but in the same way as c4 is dangerous. It isn’t dangerous on it’s own, but is once it becomes a component or tool in the creation of a new OIS. Is this an obvious or trivial thing to point out? It might be, but I think it is good to have language that makes it clear what we are talking about. More on this below.
I’ll describe 3 ways that proceeding with OIS theory might be helpful:
Characterizing and Classifying OISs:
As you noted, many OISs are safe. ( As an aside, I might disagree that they are aligned, rather, their alignment is “sufficient” WRT their capabilities. ) But the safety of other OISs is much less clear, such as market dynamics and cult like social movements, and there are examples of OISs that are strictly harmful, such as disease and addiction.
I think the fact that there are many kinds of OISs which we understand in very different levels of detail is a great opportunity for learning from examples. If we can create a formalism of OISs, and classify the similarities and differences between the kinds of OISs that appear in different fields of study, we may be able to characterize the properties of different OISs and where they the do and do not apply. That would provide us with a key for translating between the formalisms of the many fields that study particular kinds of OISs which could hopefully be leveraged into understanding progressively more complicated and intractable kinds of OIS, with the goal of eventually understanding ASI well enough to build it safely.
I don’t believe I have made too much progress on this characterization work yet. I’m more in the process of identifying important classes of properties such as the substrate the OIS is composed of, or it’s interconnection with other OISs. I hope this work will lead to classifications of OISs about which we can make special statements that generalize to all OISs in that class.
Exploring and Explaining the Strategic Situation:
I think spreading and using the OIS lens might help people consider and communicate about AI risks with more clarity. Here are 3 aspects to the OIS lens that might make it helpful for the strategic situation surrounding AI risk:
i. Dense Interconnection:
In exploring OISs it becomes clear that OISs intersect one another and are composed of one another in complicated ways. I won’t get into too much detail here, but note that a person is an OIS and may act ( among other roles ) as an employee as part of a team, which is an OIS, within a company which is yet another OIS.
This quickly becomes quite complicated in a system dynamics kind of way, but I think is important to understand, even if we cannot identify all OISs or model their dynamic interplay.
ii. Preference Independence:
Even though two OIS may be interconnected, say, an employee and their organization, that does not imply any relationship between the preferences of each of the OISs. The organization may be working to put the employee out of a job even as the employee contributes to that effort. In this situation the employee may be acting to obtain money as an instrumental goal and find themselves forced to act against their own long term interests.
Another example I like is the “dysfunctional romance”. Each of the lovers preferences would be better served if they disbanded their relationship, but the relationship itself is an OIS that seems to prefer that the two people suffer.
The bottom line is that people building or contributing to an OIS does not immediately imply the OIS is capable or sufficiently aligned WRT it’s capabilities. That needs to be proven by identifying the preference encoding and how it is acted upon by the system, or more weakly, it can be inferred empirically.
iii. Simple, Formal, Well Defined:
I think there is some degree to which the disorganization in AI terminology harms our ability to coordinate and inform stakeholders about the situation. I feel that if a sufficiently simple core model of an OIS could be defined, with examples, this would allow better communication. Confusion from people having different, fuzzy ideas of what terms like “intelligence” or “AGI” mean could be avoided. My hope is for the set of terminology of OIS to be mostly unrelated to current terminology, avoiding ambiguity by further overloading existing terms, and to cut up the space of relevant concepts in a way that is more useful than the way existing words do.
Formalizing Preference Encodings and their Related Semantic Mappings:
In a thermostat, the preferences are encoded in the position of the temperature control knob. It was designed to be that way with the preferences and capabilities clearly isolated from one another, but with neural nets and humans and human organizations, the situation is much more complicated. Is it hopeless to try to divine where and what the preferences are?
I don’t think so. First, the question can be approached empirically, approximately, and statistically, but additionally, I think Mechanistic Interpretability (MI) and Unsupervised ML may help us make sense of this previously unanswerable question. I am currently contextualizing neural networks based on the way I am approaching them in my MI work, thinking of them as mappings between semantic spaces.
In the example of a cat-dog classifier, the input is in image space, specifically concerned with two distributions in image space: the distribution of images that are of cats, and the distribution of images that are dogs. The semantics in this space is about the amount of red, green, and blue in each of the pixels. This is a very useful semantic for how cameras and computer screens work, but it is very bad for determining if the picture is a cat or a dog. For this reason the network is trained act as a semantic map from the image space to a cat-dog likelihood space.
I feel as though applying this concept to the OIS lens, especially in combination with unsupervised methods which create mappings to spaces with rich semantics, may allow us to become more clear on the separation between preferences and capabilities even in places where doing so seems impossible, such as in policy networks trained by RL, or even looking for human preferences encoded by brain scans.
Question about the OIS research direction: What sort of principles might you hope to learn about outcome influencing systems which could help with the problems of ASI?
It seems to me that the problem isn’t OIS generally. We have plenty of safe/aligned OISs. This includes some examples from your OIS doc such as thermostats. Even a lot of non-AI software systems seem like safe/aligned OISs. The problem seems to be more with this specific type of OIS (powerful deep-learning based models) which are both much more capable and much more difficult to ensure the safe behavior of, compared to software with legible code. (I read your doc quickly so may have missed an answer about this which you already provide in there.)
--
Also, thanks for putting forth a novel research direction for addressing AI risk. I think we should have a high bar for which research directions we put scarce resources into, but at the same time it’s super valuable to propose new research directions. What we are looking for could just be low-hanging fruit in the unexplored idea space, overlooked by all the people who are working on more established stuff.
Thank you for engaging : )
As per your note about directions and scarce resources, I agree. I hope the OIS agenda is not a waste of time, and if it is, I hope you can help me identify so quickly! Scout’s mindset.
Sorry, in trying to respond to your question I wrote quite a bit. Feel free to skim it and see if any of it hits the mark of what you were trying to get at. Sorry if it feels like a repeat of things I already said in my doc.
First, not a principle, but an example of the use of the OIS lens, you identified “powerful deep-learning based models” as the dangerous type of OIS, and I think they are dangerous, but in the same way as c4 is dangerous. It isn’t dangerous on it’s own, but is once it becomes a component or tool in the creation of a new OIS. Is this an obvious or trivial thing to point out? It might be, but I think it is good to have language that makes it clear what we are talking about. More on this below.
I’ll describe 3 ways that proceeding with OIS theory might be helpful:
Characterizing and Classifying OISs:
As you noted, many OISs are safe. ( As an aside, I might disagree that they are aligned, rather, their alignment is “sufficient” WRT their capabilities. ) But the safety of other OISs is much less clear, such as market dynamics and cult like social movements, and there are examples of OISs that are strictly harmful, such as disease and addiction.
I think the fact that there are many kinds of OISs which we understand in very different levels of detail is a great opportunity for learning from examples. If we can create a formalism of OISs, and classify the similarities and differences between the kinds of OISs that appear in different fields of study, we may be able to characterize the properties of different OISs and where they the do and do not apply. That would provide us with a key for translating between the formalisms of the many fields that study particular kinds of OISs which could hopefully be leveraged into understanding progressively more complicated and intractable kinds of OIS, with the goal of eventually understanding ASI well enough to build it safely.
I don’t believe I have made too much progress on this characterization work yet. I’m more in the process of identifying important classes of properties such as the substrate the OIS is composed of, or it’s interconnection with other OISs. I hope this work will lead to classifications of OISs about which we can make special statements that generalize to all OISs in that class.
Exploring and Explaining the Strategic Situation:
I think spreading and using the OIS lens might help people consider and communicate about AI risks with more clarity. Here are 3 aspects to the OIS lens that might make it helpful for the strategic situation surrounding AI risk:
i. Dense Interconnection:
In exploring OISs it becomes clear that OISs intersect one another and are composed of one another in complicated ways. I won’t get into too much detail here, but note that a person is an OIS and may act ( among other roles ) as an employee as part of a team, which is an OIS, within a company which is yet another OIS.
This quickly becomes quite complicated in a system dynamics kind of way, but I think is important to understand, even if we cannot identify all OISs or model their dynamic interplay.
ii. Preference Independence:
Even though two OIS may be interconnected, say, an employee and their organization, that does not imply any relationship between the preferences of each of the OISs. The organization may be working to put the employee out of a job even as the employee contributes to that effort. In this situation the employee may be acting to obtain money as an instrumental goal and find themselves forced to act against their own long term interests.
Another example I like is the “dysfunctional romance”. Each of the lovers preferences would be better served if they disbanded their relationship, but the relationship itself is an OIS that seems to prefer that the two people suffer.
The bottom line is that people building or contributing to an OIS does not immediately imply the OIS is capable or sufficiently aligned WRT it’s capabilities. That needs to be proven by identifying the preference encoding and how it is acted upon by the system, or more weakly, it can be inferred empirically.
iii. Simple, Formal, Well Defined:
I think there is some degree to which the disorganization in AI terminology harms our ability to coordinate and inform stakeholders about the situation. I feel that if a sufficiently simple core model of an OIS could be defined, with examples, this would allow better communication. Confusion from people having different, fuzzy ideas of what terms like “intelligence” or “AGI” mean could be avoided. My hope is for the set of terminology of OIS to be mostly unrelated to current terminology, avoiding ambiguity by further overloading existing terms, and to cut up the space of relevant concepts in a way that is more useful than the way existing words do.
Formalizing Preference Encodings and their Related Semantic Mappings:
In a thermostat, the preferences are encoded in the position of the temperature control knob. It was designed to be that way with the preferences and capabilities clearly isolated from one another, but with neural nets and humans and human organizations, the situation is much more complicated. Is it hopeless to try to divine where and what the preferences are?
I don’t think so. First, the question can be approached empirically, approximately, and statistically, but additionally, I think Mechanistic Interpretability (MI) and Unsupervised ML may help us make sense of this previously unanswerable question. I am currently contextualizing neural networks based on the way I am approaching them in my MI work, thinking of them as mappings between semantic spaces.
( I expanded on the idea of semantic spaces in “Zoom Out: Distributions in Semantic Spaces”. )
In the example of a cat-dog classifier, the input is in image space, specifically concerned with two distributions in image space: the distribution of images that are of cats, and the distribution of images that are dogs. The semantics in this space is about the amount of red, green, and blue in each of the pixels. This is a very useful semantic for how cameras and computer screens work, but it is very bad for determining if the picture is a cat or a dog. For this reason the network is trained act as a semantic map from the image space to a cat-dog likelihood space.
I feel as though applying this concept to the OIS lens, especially in combination with unsupervised methods which create mappings to spaces with rich semantics, may allow us to become more clear on the separation between preferences and capabilities even in places where doing so seems impossible, such as in policy networks trained by RL, or even looking for human preferences encoded by brain scans.