Still haven’t heard a better suggestion than CEV.
TristanTrim
Yeah! That was the post that got me to really deeply believe the Orthogonality Thesis. “Naturalistic Awakening” and “Human Guide to Words” are my two favourite sequences.
OISs are actually a slightly broader definition than optimization processes for two reasons though: (1) OISs have capabilities, not intelligence, and (2) OIS capabilities are arbitrarily general.
(1) The important distinction is that OISs are defined in terms of capabilities not in terms of intelligence, where capabilities can be broken down into skills, knowledge, and resource access.
This is valuable for breaking skills down into skill domains, which is relevant for risk assessment, while intelligence is a kind of generalizable skill that seems to be very poorly defined and usually more distracting to valuable analysis in my opinion.
Also, resource access has the compounding property that knowledge and skill also have which could potentially lead to dangerously compounding capabilities. Making it explicit that “intelligence” is not the only aspect of an OIS that has this compounding property seems important.
(2) Is less well considered and less important. The example I have for this is a bottle cap. A bottle cap makes it more likely that water will stay in a bottle, but it isn’t an optimizer, it is an optimized object. When viewed through the optimizer lens, the bottle cap doesn’t want to keep the water in, rather, it was optimized by something that does want to keep the water in, so it is not an optimizer. That is, the cap has extremely fragile capabilities. It keeps the water in when it is screwed on, but if it is unscrewed it has no ability on it’s own to put itself back on or try to continue keeping the water in. This must be very nearly the limit in how little it is possible for capabilities to generalize.
However, from the OIS lens, the cap indeed makes water staying in the bottle a more likely outcome, and we can say that in some sense it does want to keep the water in.
I find it a little frustrating how general this makes the definition, and I’m sure other people will as well, but I think it is more useful in this case to cast a very wide net and then try to understand the differences between the kinds of things caught by that net, rather than working with the overly limited definitions that fail to reference the objects I am interested in. It also highlights the potential issues with highly optimized fragile OIS. If we need them to generalize, it is a problem that they won’t, and if we are expecting safety because something “isn’t actually an optimizer” that may not matter if it is sufficiently well optimized over a sufficiently dangerous domain of capability.
TT Self Study Journal # 5
ToW: Response to “6 reasons why alignment-is-hard discourse...”. I liked this post. I’d like to write out some of my thoughts on it.
ToW: Exploration of simulacrum levels. It feels to me like the situation should be more of an interconnected web thing than discrete levels, but I haven’t thought it through enough yet.
ToW: “Does demanding rights make problem solvers feel insulted?” informal social commentary exploring my thoughts on the relationship between human rights and the systems we employ to ensure standards of human rights can be met.
ToW: Map articulating all talking (Maat) sequence. Current post plan is described in the first post of the sequence.
ToW: Outcome Influencing Systems (OISs) brief explainer.
High level goals and theory of change.
Working definition.
Important properties.
ToW: “Shut up about consciousness”, a rant about how I think “consciousness” distracts from anything I want to talk about, probably including exploration of how OIS terminology is designed to avoid “consciousness” or other religious questions.
ToW: A review and explanation of the more math AI topics (Statistic ML theory, Universal AI, Infrabayes, etc...) targeted at non math audiences.
Tristan’s list of things to write
No problem. Hope my criticism didn’t come across as overly harsh. I’m grateful for your engagement : )
I think the reason humans care about other people’s interests, and aren’t power-seeking ruthless consequentialists, is because of evolution.
This is kinda a weird way to phrase it since if I’m modelling the causal chain right:
(evolution)->(approval reward)->(not ruthless)
So yeah, evolution is causally upstream of not ruthless, but approval reward is the thing directly before it. Evolution caused all human behaviour, so if you observe any behaviour any human ever exhibits you can validly say “this is because of evolution”.
I imagine that you’re using a more specific definition of it than I am here.
I might be. I might also be using a more general definition. Or just a different one. Alas, that’s natural language for you.
very far into making AI systems do a lot of valuable work for us with very low risk.
I agree, but feel it’s important to note the low risk is only locally low. Globally I think the risk is catastrophic.
I think the biggest difference in our POV might be that I think the systems we are using to control what happens in our world (markets, governments, laws) are already misaligned and heading towards disasters, and if we allow them to continue getting more capable they will not suddenly be capable enough to get back on track because they were never aligned to target human friendly preferences in the first place. Rather, they target proxies, but capabilities have gone beyond the point where those proxies are articulate enough for good outcomes. We need to switch focus from capabilities to alignment.
Approval Reward (AR) is a particular kind of corrigibility, so anything that isn’t corrigibility isn’t AR and somethings that are corrigibility still aren’t AR. The concept is bounded. Precise bounding doesn’t se”em valuable while first exploring concepts. First come the fuzzy bounds, then the bounds can be shored up where it is important.
I agree with you that people are complex and AR probably doesn’t apply to all instances of someone saving to buy a car, but I’d be surprised if AR never applied to someone saving to buy a car.
The most important prediction of AR is that people misgeneralizing AR onto non-human entities make mistakes in predicting those non-human entities. People who approach wild animals are an example of this. I think anthropomorphizing machines is possible. I bet people would be less safety conscious around an industrial machine made to look friendly than one made to look scary. And most importantly, what I see as the key point of this post, we can predict that people who are optimistic about AGI are more likely to be reasoning based on AR as opposed to reasoning based on systems dynamics and theory of agents.
My own internal gears level model suggests that AR is only one component of what makes people erroneously optimistic. Other possibilities being that people are too optimistic by default, that people are incentivize to be optimistic, and the abundance of salient (but imprecise) reference classes with examples suggesting things will go well. In fact, AR could be seen as an important specific case of a reference class.
I think “alignment plasticity” is called “corrigibility”.
I agree with your view that approval reward as an AGI target would be complicated. I’d add the detail that even robustly desiring the approval of humans is probably not a good thing for an ASI to be doing, in the same way as a “smile optimizer” would not be a good thing for people who want to smile because they are happy.
unless we discover some magic universal value system that is relevant for all of humanity for all eternity
I’m not a huge fan of your dismissive tone here. My goal is to help humanity build a system for encoding such a thing. I think it is very difficult. Probably the most difficult thing humanity has ever attempted by far. But I do not think it is impossible, and it is only “magic” in the sense that any engineering discipline is magic.
My impression is that companies are very short sighted, optimizing for quarterly and yearly results even if it has a negative effect on the companies performance in 5 years and even if it has negative effects on society. I also think many (most?) companies view regulations not as signals for how they should be behaving but more like board game rules, if they can change or evade the rules to profit, they will.
I’ll also point out that it is probably in the best interest of many customers to be pissed off. Sycophantic products make more money than ones that force people to confront ways they are acting against their own values. It is my estimation that that is a pretty big problem.
But the thing I’m most worried about is companies succeeding at “making solid services/products that work with high reliability” without actually solving the alignment problem, and then it becomes even more difficult to convince people there even is a problem as they further insulate themselves from anyone who disagrees with their hyper-niche worldview.
You seem funny and smart! Haha! Self aggrandizing can be charming but can also be grating and obnoxious… In my experience, describing self aggrandizing desires I observe within myself typically goes poorly… My guess is it is too advanced a skill for a low quality psychopath like myself.
Are you familiar with the bicameral mind? I kinda have a vibe that we didn’t really stop doing this, we just built rules around how we’re allowed to view it. Echo’s and memories of speech from within oneself is still chaotic and difficult to understand, but we are taught we are individuals and need to take responsibility for any actions caused by the voices we hear. We need to be able to explain our actions as part of a cohesive personal identity. But this is something we are culturally taught, not something innate to the working of the human mind. Once you realize this you can work with identity and voice inside your mind however you want. But of course, break your sense of identity at your own risk!
You might be interested in a model / terminology / lens I’m trying to get off the ground: Outcome Influencing Systems (OISs) are any system with preferences and capabilities that use their capabilities to influence reality towards outcomes according to their preferences. An important aspect of this definition is that it includes not just AI and ASI, but also humans and human organizations. I think it’s useful because then we can more easily talk about the risk of misaligned OIS which makes it explicit that we are talking both about potential ASI and about organizations that may create ASI and about organizations that may pose catastrophic risk due to their capabilities, regardless of whether they are using anything laypeople would identify as an AI.
there’s a very large correlation between “not being scary” and “being commercially viable”, so I expect a lot of pressure for non-scary systems
I agree that there is an optimization pressure here, but I don’t think it robustly targets “don’t create misaligned superintelligence” rather, it targets “customers and regulators not being scared” which is very different from “don’t make things customers and regulators should be scared of”.
Thanks : )