Has it been rolled out yet? I would really like this feature.
RE spamming: certainly they can be disabled by default, and you can have an unsubscribe button at the bottom of every email?
I view this as a capability control technique, highly analogous to running a supervised learning algorithm where a reinforcement learning algorithm is expected to perform better. Intuitively, it seems like there should be a spectrum of options between (e.g.) supervised learning and reinforcement learning that would allow one to make more fine-grained safety-performance trade-offs.
I’m very optimistic about this approach of doing “capability control” by making less agent-y AI systems. If done properly, I think it could allow us to build systems that have no instrumental incentives to create subagents (although we’d still need to worry about “accidental” creation of subagents and (e.g. evolutionary) optimization pressures for their creation).
I would like to see this fleshed out as much as possible. This idea is somewhat intuitive, but it’s hard to tell if it is coherent, or how to formalize it.
P.S. Is this the same as “platonic goals”? Could you include references to previous thought on the topic?
I realized it’s unclear to me what “trying” means here, and in your definition of intentional alignment. I get the sense that you mean something much weaker than MIRI does by “(actually) trying”, and/or that you think this is a lot easier to accomplish than they do. Can you help clarify?
It seems like you are referring to daemons.
To the extent that daemons result from an AI actually doing a good job of optimizing the right reward function, I think we should just accept that as the best possible outcome.
To the extent that daemons result from an AI doing a bad job of optimizing the right reward function, that can be viewed as a problem with capabilities, not alignment. That doesn’t mean we should ignore such problems; it’s just out of scope.
Indeed, most people at MIRI seem to think that most of the difficulty of alignment is getting from “has X as explicit terminal goal” to “is actually trying to achieve X.”
That seems like the wrong way of phrasing it to me. I would put it like “MIRI wants to figure out how to build properly ‘consequentialist’ agents, a capability they view us as currently lacking”.
Can you please explain the distinction more succinctly, and say how it is related?
I don’t think I was very clear; let me try to explain.
I mean different things by “intentions” and “terminal values” (and I think you do too?)
By “terminal values” I’m thinking of something like a reward function. If we literally just program an AI to have a particular reward function, then we know that it’s terminal values are whatever that reward function expresses.
Whereas “trying to do what H wants it to do” I think encompasses a broader range of things, such as when R has uncertainty about the reward function, but “wants to learn the right one”, or really just any case where R could reasonably be described as “trying to do what H wants it to do”.
Talking about a “black box system” was probably a red herring.
Another way of putting it: A parochially aligned AI (for task T) needs to understand task T, but doesn’t need to have common sense “background values” like “don’t kill anyone”.
Narrow AIs might require parochial alignment techniques in order to learn to perform tasks that we don’t know how to write a good reward function for. And we might try to combine parochial alignment with capability control in order to get something like a genie without having to teach it background values. When/whether that would be a good idea is unclear ATM.
It doesn’t *necessarily*. But it sounds like what you’re thinking of here is some form of “sufficient alignment”.
The point is that you could give an AI a reward function that leads it to be a good personal assistant program, so long as it remains restricted to doing the sort of things we expect a personal assistant program to do, and isn’t doing things like manipulating the stock market when you ask it to invest some money for you (unless that’s what you expect from a personal assistant). If it knows it could do things like that, but doesn’t want to, then it’s more like something sufficiently aligned. If it doesn’t do such things because it doesn’t realize they are possibilities (yet), or because it hasn’t figured out a good way to use it’s actuators to have that kind of effect (yet), because you’ve done a good job boxing it, then it’s more like “parochially aligned”.
This is one of my main cruxes. I have 2 main concerns about honest mistakes:
1) Compounding errors: IIUC, Paul thinks we can find a basin of attraction for alignment (or at least corrigibility...) so that an AI can help us correct it online to avoid compounding errors. This seems plausible, but I don’t see any strong reasons to believe it will happen or that we’ll be able to recognize whether it is or not.
2) The “progeny alignment problem” (PAP): An honest mistake could result in the creation an unaligned progeny. I think we should expect that to happen quickly if we don’t have a good reason to believe it won’t. You could argue that humans recognize this problem, so an AGI should as well (and if it’s aligned, it should handle the situation appropriately), but that begs the question of how we got an aligned AGI in the first place. There are basically 3 subconcerns here (call the AI we’re building “R”):
2a) R can make an unaligned progeny before it’s “smart enough” to realize it needs to exercise care to avoid doing so.
2b) R gets smart enough to realize that solving PAP (e.g. doing something like MIRI’s AF) is necessary in order to develop further capabilities safely, and that ends up being a huge roadblock that makes R uncompetitive with less safe approaches.
2c) If R has gamma < 1, it could knowingly, rationally decide to build a progeny that is useful through R’s effective horizon, but will take over and optimize a different objective after that.
2b and 2c are *arguably* “non-problems” (although they’re at least worth taking into consideration). 2a seems like a more serious problem that needs to be addressed.
This is not what I meant by “the same values”, but the comment points towards an interesting point.
When I say “the same values”, I mean the same utility function, as a function over the state of the world (and the states of “R is having sex” and “H is having sex” are different).
The interesting point is that states need to be inferred from observations, and it seems like there are some fundamentally hard issues around doing that in a satisfying way.
So my original response was to the statement:
Differential research that advances safety more than AI capability still advances AI capability.
Which seems to suggest that advancing AI capability is sufficient reason to avoid technical safety that has non-trivial overlap with capabilities. I think that’s wrong.
RE the necessary and sufficient argument:
1) Necessary: it’s unclear that a technical solution to alignment would be sufficient, since our current social institutions are not designed for superintelligent actors, and we might not develop effective new ones quickly enough
2) Sufficient: I agree that never building AGI is a potential Xrisk (or close enough). I don’t think it’s entirely unrealistic “to shoot for levels of coordination like ‘let’s just never build AGI’“, although I agree it’s a long shot. Supposing we have that level of coordination, we could use “never build AGI” as a backup plan while we work to solve technical safety to our satisfaction, if that is in fact possible.
Moving on from that I’m thinking that we might need a broad base of support from people (depending upon the scenario) so being able to explain how people could still have meaningful lives post AI is important for building that support. So I’ve been thinking about that.
This sounds like it would be useful for getting people to support the development of AGI, rather than effective global regulation of AGI. What am I missing?
Can you give some arguments for these views?
I think the best argument against institution-oriented work is that it might be harder to make a big impact. But more importantly, I think strong global coordination is necessary and sufficient, whereas technical safety is plausibly neither.
I also agree that one should consider tradeoffs, sometimes. But every time someone has raised this concern to me (I think it’s been 3x?) I think it’s been a clear cut case of “why are you even worrying about that”, which leads me to believe that there are a lot of people who are overconcerned about this.
It seems like the preferences of the AI you build are way more important than its experience (not sure if that’s what you mean).
This is because the AIs preferences are going to have a much larger downstream impact?
I’d agree, but caveat that there may be likely possible futures which don’t involve the creation of hyper-rational AIs with well-defined preferences, but rather artificial life with messy incomplete, inconsistent preferences but morally valuable experiences. More generally, the future of the light cone could be determined by societal/evolutionary factors rather than any particular agent or agent-y process.
I found your 2nd paragraph unclear...
the goals happen to overlap enough
Is this referring to the goals of having “AIs that have good preferences” and “AIs that have lots of morally valuable experience”?
Are you funding constrained? Would you give out more money if you had more?
FWIW, I think I represent the majority of safety researchers in saying that you shouldn’t be too concerned with your effect on capabilities; there’s many more people pushing capabilities, so most safety research is likely a drop in the capabilities bucket (although there may be important exceptions!)
Personally, I agree that improving social institutions seems more important for reducing AI-Xrisk ATM than technical work. Are you doing that? There are options for that kind of work as well, e.g. at FHI.
Overall, I think the question “which AIs are good successors?” is both neglected and time-sensitive, and is my best guess for the highest impact question in moral philosophy right now.
Interesting… my model of Paul didn’t assign any work in moral philosophy high priority.
I agree this is high impact. My idea of the kind of work to do here is mostly trying to solving the hardish problem of consciousness so that we can have some more informed guess as to the quantity and valence of experience that different possible futures generate.
I don’t think most places have enough ML courses at the undergraduate level; I’d expect 0-2 undergraduate ML courses at a typical large or technically focused university. OFC, you can often take graduate courses as an undergraduate as well.
There are lots of graduate ML programs that will give you ML background (although that might not be the most efficient route; e.g. compare with Google Brain Residency).
Is there a clear academic path towards getting a good background for AF? Maybe mathematical logic? RAISE might be filling that niche?