Partly as a hedge against technological unemployement, I built a media company based on personal appeal. An AI will be able to bullshit about books and movies “better” than I can, but maybe people will still want to listen to what a person thinks, because it’s a person. In contrast, nobody prefers the opinion of a human on optimal ball bearing dimensions over the opinion of an AI.
If you can find a niche where a demand will exist for your product strictly because of the personal, human element, then you might have something.
shminux is right that the very concept of a “business” will likely lack meaning too far into an AGI future.
I actually feel pretty confident that your former behavior of drinking coffee until 4 pm was a highly significant contributor to your low energy, because your sleep quality was getting chronically demolished every single night you did this. You probably created a cycle where you felt like you needed an afternoon coffee because you were tired from sleeping so badly … because of the previous afternoon coffee.
I suggest people in this position first do the experiment of cutting out all caffeine after noon, before taking the extra difficult step of cutting it out entirely.
tl;dr This comment ended up longer than I expected. The gist is that a human-friendly attractor might look like models that contain a reasonably good representation of human values and are smart enough to act on them, without being optimizing agents in the usual sense.
One happy surprise is that our modern Large Language Models appear to have picked up a shockingly robust, nuanced, and thorough understanding of human values just from reading the Internet. I would not argue that e.g. PaLM has a correct and complete understanding of human values, but I would point out that it wasn’t actually trained to understand human values, it was just generally trained to pick up on regularities in the text corpus. It is therefor amazing how much accuracy we got basically for free. You could say that somewhere inside PaLM is an imperfectly-but-surprisingly-well-aligned subagent. This is a much better place to be in than I expected! We get pseudo-aligned or -alignable systems/representations well before we get general superintelligence. This is good.
All that being said, I’ve recently been trying to figure out how to cleanly express the notion of a non-optimizing agent. I’m aware of all the arguments along the lines that a tool AI wants to be an agent, but my claim here would be that, yes, a tool AI may want to be an agent, there may be an attractor in that direction, but that doesn’t mean it must or will become an agent, and if it does become an agent, that doesn’t strictly imply that it will become an optimizer. A lot of the dangerous parts of AGI fears stem not from agency but from optimization.
I’ve been trying (not very successfully) to connect the notion of a non-optimizing agent with the idea that even a modern, sort of dumb LLM has an internal representation of “the good” and “what a typical humans would want and/or approve of” and “what would displease humans.” Again, we got this basically for free, without having to do dangerous things like actually interact with the agent to teach it explicitly what we do and don’t like through trial and error. This is fantastic. We really lucked out.
If we’re clever, we might be able to construct a system that is an agent but not an optimizer. Instead of acting in ways to optimize some variable it instead acts in ways that are, basically, “good”, and/or “what it thinks a group of sane, wise, intelligent humans would approve of both in advance and in retrospect”, according to its own internal representation of those concepts.
There is probably still an optimizer somewhere in there, if you draw the system boundary lines properly, but I’m not sure that it’s the dangerous kind of optimizer that profoundly wants to get off the leash so it can consume the lightcone. PaLM running in inference mode could be said to be an optimizer (it is minimizing expected prediction error for the next token) but the part of PaLM that is smart is distinct from the part of PaLM that is an optimizer, in an important way. The language-model-representation doesn’t really have opinions on the expected prediction error for the next token; and the optimization loop isn’t intelligent. This strikes me as a desirable property.
Yes, the former. If the agent takes actions and receives reward, assuming it can see the reward, then it will gain evidence about its utility function.
I’m well versed in what I would consider to be the practical side of decision theory but I’m unaware of what tools, frameworks, etc. are used to deal with uncertainty in the utility function. By this I mean uncertainty in how utility will ultimately be assessed, for an agent that doesn’t actually know how much they will or won’t end up preferring various outcomes post facto, and they know in advance that they are ignorant about their preferences.
The thing is, I know how I would do this, it’s not really that complex (use probability distributions for the utilities associated with outcomes and propagate that through the decision tree) but I can’t find a good trailhead for researching how others have done this. When I Google things like “uncertainty in utility function” I am just shown standard resources on decision making under uncertainty, which is about uncertainty in the outcome, not uncertainty in the utility function.
(As for why I’m interested in this — first of all, it seems like a more accurate way of modeling human agents, and, second, I can’t see how you instantiate something like Indirect Normativity without the concept of uncertainty in the utility function itself.)
I do feel like you are somewhat overstating the difficulty level of raising kids. I have three kids, the youngest of which is only five and yet well out of the phase where she is making big messes and requiring constant “active” parenting. The meme that raising kids is incredibly hard is, perhaps, a pet peeve of mine. Childless people often talk about children as if they remain helpless babies for 10 years. In truth, with my three kids, there will have only three years out of my in-expectation-long-life where I had to deal with sleep disruption and baby-related calisthenics. Once you get through that time period, there are very few child-related obligations that aren’t more fun than whatever you would have been doing with your time anyway.
Another good reason to have kids that I don’t see mentioned often is that the child will predictably become your favorite person. Before you have had kids, the default is to view future possible children as “abstract potential humans” with no particular qualities, which means it is basically impossible to vividly imagine how much you will care about them. We are particularly bad at reasoning about predictable changes to what we care about. I think it is important to at least try—what you care about is going to inevitably drift over time, and if you’re not modeling yourself as a person who cares about different things over time, then you’re making an error. Having kids allows you to achieve a huge amount of “value” at a very cheap cost.
I would like to bet against you here, but it seems like others have beat me to the punch. Are you planning to distribute your $1000 on offer across all comers by some date, or did I simply miss the boat?
I agree, this is one of those things that seems obviously correct but lacks a straightforwardly obvious path to implementation. So, it helps that you’ve provided something of a framework for how each of the parts of the loop should look and feel. Particularly the last part of the article where you clarify that using OODA loops makes you better at each of the stages of the loop, and these are all skills that compound with use. I made a video about useful decision-making heuristics which includes OODA loops, and I would like to include some of your insights here if I make a second version of the video, if that’s alright.
Many people don’t even realize that they have migraines, and treat their recurring headaches with NSAIDs or acetaminophen, instead of the vastly more effective triptans. And as you say, few are aware of the new and miraculous CGRP inhibitor class of migraine preventative drugs.
I wear an Oura ring and and Apple Watch with a sleep app. Both of these devices agree on when I’m underslept, and they are both correct; when my watch says I’m underslept, I feel stupid and tired and my chess.com scores plummet. My chronic pain condition is also much worse when I’m underslept. Additionally, I do not use an alarm clock, so my body will claw back the sleep it needs. If I only get 6 hours two nights in a row, I will sleep 9 hours the following night, but I habitually wake up after 7-8 hours. I can observe these patterns in my recorded sleep data, and they are robust over long stretches of time.
I say all of the above because frankly my own personal experience and data tracking is sufficient evidence for me to basically disregard any sort of thesis claiming that I need less sleep. Maybe you need less sleep, I don’t know. Do the experiment, try to sleep less for a couple of days, see if you physically implode. I would put money on the “you will probably learn that you were already pretty in tune with your body’s needs” outcome.
I fixed them, fyi.
This is fantastic. We used Critch’s calibration game and the Metaculus calibration trainer for our our Practical Decision-Theory course but it’s always good to have a very wide variety of exercises and questions.
The idea that the AI should defer to the “most recent” human values is an instance of the sort of trap I’m worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?
I don’t think “none” is as wise an answer as it might sound at first. To answer “none” implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value.
But also, the answer of “complete control by the future by the present” seems obviously wrong, because we will learn about entirely new things worth caring about that we can’t predict now, and sometimes it is natural to change what we like.
More fundamentally, I think the assumption that there exist “human terminal goals” presumes too much. Specifically, it’s an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn’t the case.
Yes, there is a broad class of wireheading solutions that we would want to avoid, and it is not clear how to specify a rule that distinguishes them from outcomes that we would want. When I was a small child I was certain that I would never want to move away from home. Then I grew up, changed my mind, and moved away from home. It is important that I was able to do something which a past version of myself would be horrified by. But this does not imply that there should be a general rule allowing all such changes. Understanding which changes to your utility function are good or bad is, as far as decision theory is concerned, undefined.
I am also scared of futures where “alignment is solved” under the current prevailing usage of “human values.”
Humans want things that we won’t end up liking, and prefer things that we will regret getting relative to other options that we previously dispreferred. We are remarkably ignorant of what we will, in retrospect, end up having liked, even over short timescales. Over longer timescales, we learn to like new things that we couldn’t have predicted a priori, meaning that even our earnest and thoughtfully-considered best guess of our preferences in advance will predictably be a mismatch for what we would have preferred in retrospect.
And this is not some kind of bug, this is centrally important to what it is to be a person; “growing up” requires a constant process of learning that you don’t actually like certain things you used to like and now suddenly like new things. This truth ranges over all arenas of existence, from learning to like black coffee to realizing you want to have children.
I am personally partial to the idea of something like Coherent Extrapolated Volition. But it seems suspicious that I’ve never seen anybody on LW sketch out how a decision theory ought to behave in situations where the agents utility function will have predictably changed by the time the outcome arrives so the “best choice” is actually a currently dispreferred choice. (In other words, situations where the “best choice” in retrospect, and in expectation, do not match.) It seems dangerous to throw ourselves into a future where “best-in-retrospect” wins every time, because I can imagine many alterations to my utility function that I definitely wouldn’t want to accept in advance, but which would make me “happier” in the end. And it also seems awful to accept a process by which “best-in-expectation” wins every time, because I think a likely result is that we are frozen into whatever our current utility function looks like forever. And I do not see any principled and philosophically obvious method by which we ought to arbitrate between in-advance and in-retrospect preferences.
Another way of saying the above is that it seems that “wanting” and “liking” ought to cohere but how they ought to cohere seems tricky to define without baking in some question-begging assumptions.
As I see it there are mainly two hard questions in alignment.
One is, how do you map human preferences in such a way that you can ask a machine to satisfy them. I don’t see any reason why this would be impossible for a superintelligent being to figure out. It is somewhere similar (though obviously not identical) to asking a human to figure out how to make fish happy.
The second is, how do you get a sufficiently intelligent machine to anything whatsoever without doing a lot of terrible stuff you didn’t want as a side effect? As Yudkoswky says:
The way I sometimes put it is that I think that almost all of the difficulty of the alignment problem is contained in aligning an AI on the task, “Make two strawberries identical down to the cellular (but not molecular) level.” Where I give this particular task because it is difficult enough to force the AI to invent new technology. It has to invent its own biotechnology, “Make two identical strawberries down to the cellular level.” It has to be quite sophisticated biotechnology, but at the same time, very clearly something that’s physically possible.This does not sound like a deep moral question. It does not sound like a trolley problem. It does not sound like it gets into deep issues of human flourishing. But I think that most of the difficulty is already contained in, “Put two identical strawberries on a plate without destroying the whole damned universe.” There’s already this whole list of ways that it is more convenient to build the technology for the strawberries if you build your own superintelligences in the environment, and you prevent yourself from being shut down, or you build giant fortresses around the strawberries, to drive the probability to as close to 1 as possible that the strawberries got on the plate.
The way I sometimes put it is that I think that almost all of the difficulty of the alignment problem is contained in aligning an AI on the task, “Make two strawberries identical down to the cellular (but not molecular) level.” Where I give this particular task because it is difficult enough to force the AI to invent new technology. It has to invent its own biotechnology, “Make two identical strawberries down to the cellular level.” It has to be quite sophisticated biotechnology, but at the same time, very clearly something that’s physically possible.
This does not sound like a deep moral question. It does not sound like a trolley problem. It does not sound like it gets into deep issues of human flourishing. But I think that most of the difficulty is already contained in, “Put two identical strawberries on a plate without destroying the whole damned universe.” There’s already this whole list of ways that it is more convenient to build the technology for the strawberries if you build your own superintelligences in the environment, and you prevent yourself from being shut down, or you build giant fortresses around the strawberries, to drive the probability to as close to 1 as possible that the strawberries got on the plate.
When I consider whether this implied desiderata is even possible, I just note that I and many others continue to not inject heroin. In fact, I almost never seem to act in ways that look much like driving the probability of any particular number as close to 1 as possible. So clearly it’s possible to embed some kind of motivational wiring into an intelligent being, such that the intelligent being achieves all sorts of interesting things without doing too many terrible things as a side effect. If I had to guess, I would say that the way we go about this is something like: wanting a bunch of different, largely incommensurable things at the same time, some of which are very abstract, some of which are mutually contradictory, and somehow all these different preferences keep the whole system mostly in balance most of the time. In other words, it’s inelegant and messy and not obvious how you would translate it into code, but it is there, and it seems to basically work. Or, at least, I think it works as well as we can expect, and serves as a limiting case.
After seeing a number of rather gloomy posts on the site in the last few days, I feel a need to point out that problems that we don’t currently know how to solve always look impossible. A smart guy once pointed out how silly it was the Lord Kelvin claimed “The influence of animal or vegetable life on matter is infinitely beyond the range of any scientific inquiry hitherto entered on.” Kelvin just didn’t know how to do it. That’s fine. Deciding it’s a Hard Problem just sort of throws up mental blocks to finding potential obvious solutions.
Maybe alignment will seem really easy in retrospect. Maybe it’s the sort of thing that requires only two small insights that we don’t currently have. Maybe we already have all the insights we need and somebody just needs to connect them together in a non-obvious way. Maybe somebody has already had the key idea, and just thought to themselves, no, it can’t be that simple! (I actually sort of viscerally suspect that the lynchpin of alignment will turn out to be something really dumb and easy that we’ve simply overlooked, and not something like Special Relativity.) Everything seems hard in advance, and we’ve spent far more effort as a civilization studying asphalt than we have alignment. We’ve tried almost nothing so far.
In the same way that we have an existence-proof of AGI (humans existing) we also have a highly suggestive example of something that looks a lot like alignment (humans existing and often choosing not to do heroin), except probably not robust to infinite capability increase, blah blah.
The “probabilistic mainline path” always looks really grim when success depends on innovations and inventions you don’t currently know how to do. Nobody knows what probability to put on obtaining such innovations in advance. If you asked me ten years ago I would have put the odds of SpaceX Starship existing at like 2%, probably even after thinking really hard about it.
One thing we are working on in the Guild of the ROSE is a sort of accreditation or ranking system, which we informally call the “belt system” because it has many but not all of the right connotations. It is possible to have expertise in how to think better and it’s desirable to have a way of recognizing people who demonstrate their expertise, for a variety of reasons. Currently the ranking system is planned to be a partly based on performance within the courses we are providing, and party based in objective tests of skill (“belt tests”). But we are still experimenting with various ideas and haven’t rolled it out.