Fair point about implementation. I was imagining a non-consequentialist AI simulating consequentialist agents that would make plans of the form “run this piece of code and it will take care of the implementation” but there’s really no reason to assume that would be the case.
As far as architecture search, “search space” does seem like the right term, but I think long-term planning is potentially useful in a search space as much as it is in a stateful environment. If you think about the way a human researcher generates neural net architectures, they’re not just “trying things” in order to explore the search space… they generate abstract theories of how and why different approaches work, experiment with different approaches in order to test those theories, and then iterate. A really good NAS system would do the same, and “generate plausible hypotheses and find efficient ways to test them” is a planning problem.
This may be a dumb question, but how can you asymptotically guarantee human-level intelligence when the world-models have bounded computation time, and the human is a “computable function” that has no such limit? Is it because the number of Turing machines is infinite?
Would you mind explaining what the retracted part was? Even if it was a mistake, pointing it out might be useful to others thinking along the same lines.
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
Typo: some of the hover-boxes say nu but seem to be referring to the letter mu.
It’s true that the current optimization algorithms are mostly simple, but they’re quickly getting less simple. Neural architecture search is not fundamentally limited—it can benefit from all of the capabilities available in machine learning. “Searching the space of plans” is one of those capabilities (see Alpha Go Zero, for instance) and there’s nothing stopping researchers from putting long-term planning into architecture search, except maybe lack of compute. Indeed, non-naive optimization is how I would most readily tend to visualize any “self-improving” AI.
I also agree that optimization problems might come up as part of a prediction process that models actions of intelligent agents. This is similar to what I was half-joking about with respect to the AI-box experiment: most of the danger is in calculating the solution to the optimization problem. It’s only a small step from there to somehow getting it implemented. I can’t say I fully understand Paul’s mechanism for using the universal prior, but I’ll think about it some more.
By the way, here’s what mini-GPT-2 has to say about escaping:
Prompt: Let me tell you why I let the AI out of its box… I swore I would never do it, because I knew it might destroy the world. But then it said:
Answer: “Here’s one way to take it away and save humanity. We need to do something.” I had no idea what that meant… and I had no idea what the right thing to do was. I took a few shots at him, but decided to give him a good reason to continue. This was the best way to make sure the AI took the wrong course. When he moved, the AI stopped moving as he did so. It then turned to face me from behind, and I didn’t even want to say anything. I was just too frightened to react. I’m so used to having my emotions read like a book, that I couldn’t say a word. It wasn’t something I wanted to say, but the fact was that I couldn’t. “What happened?” I asked. The AI looked at me and then at me, eyes wide with confusion. He was a giant, giant man-eating monster. I froze to death, knowing that he was actually a human being. I felt a little helpless. “We’re not going to let you go. We can do it. But we’ll have to learn to let people do it. We’re going to have to do it alone, and then we can take it away. Can we do it alone?” I told him. “Well… let’s let us do it together. How are we going to do it?” the AI asked. For a moment, I thought that I was going to die. Then I started to feel a strange feeling, but that didn’t stop me. The AI kept moving, and I got the feeling that I was going to die. I looked around and saw a crowd of people gathered around me. One of them was a woman with a gun. I didn’t know what she was talking about. She was dressed in a t-shirt, and her hands were chained to the back of her head. She looked like a demon’s, but my shock of her being a giant monster made her look like a giant. I knew she was going to be so horrified that I was going to kill her… but I was not going to be a part of it. “We know you’re going to be a part of this. We can do it. We can do it together. Together.” she said. “What are you talking about?” I took a step back. I had to remember to be quiet. I should’ve been talking to her earlier, but then this meeting had just ended. I turned my head to see a crowd, a bunch of people, and then the whole thing slowed down. I didn’t need to react, because I was in a place where nothing was happening. At the time, I felt like I was in a fantasy. This was just something that I had heard from friends and family, or something we might have. Maybe we would have stopped talking to each other. Maybe we’d have stopped talking when I told him, but I wouldn’t have. I told myself that I would have to save humanity. Even then, I still had no idea what to do. I don’t remember what the right thing to do was. But I did have a
Thank you, I’d been thinking about some related issues recently (especially with regard to the blue-minimizing robot) and this post helped clarify things quite a bit. In particular, it highlights the distinction between urges that arise out of fear of long-term consequences and overrides accomplished by willpower, which I have often tended to confuse. I look forward to the second post.
It seems like although the model itself is not consequentialist, the process of training it might be. That is, the model itself will only ever generate a prediction of the next word, not an argument for why you should give it more resources. (Unless you prompt it with the AI-box experiment, maybe? Let’s not try it on any superhuman models...) The word it generates does not have goals. The model is just the product of an optimization. But in training such a model, you explicitly define a utility function (minimization of prediction error) and then run powerful optimization algorithms on it. If those algorithms are just as complex as the superhuman language model, they could plausibly do things like hack the reward function, seek out information about the environment, or try to attain new resources in service of the goal of making the perfect language model.
I recommend taking a look here. I haven’t done all the exercises but they seem like great practice.
Why is average wellbeing a goodharted measure?
But, like, how do you actually do that? I make three times what I did in grad school, but somehow it doesn’t feel like my standard of living has changed much, and I still basically spend everything I make...
I guess the problem is that “consumptive patterns” can be sneaky, and sometimes you didn’t notice they were there all along. The rent doubled because I moved to a city, even though my apartment’s not much nicer; my cell phone is no longer on a family plan; my parents no longer buy me plane tickets home for Christmas; I take the train to work every day. Maybe the cat gets sick and suddenly there are vet bills. In other words, nothing that feels like much of a change in consumption, yet the expenses keep going up.
And then there are a bunch of little expenditures, each one of which feels reasonable: What’s the harm in fresh vegetables, or a gym membership; won’t you save money on health problems in the long run? Wouldn’t it be dumb to worry about a $10 movie ticket or spend 20 minutes looking for free parking, when you make $30+/hr? I know people who make a lot of money but spend a lot of time and effort trying to avoid small expenses, and that doesn’t seem like a good way to live either. Sometimes I think the “save half your income and retire early” crowd is actually just faking it somehow.
I think what’s being called “TFTWF” here is what some other places call “Tit for Two Tats”, that is, it defects in response to two defections in a row.
The concepts discussed here remind me of a book I read recently called “The Cure: Enterprise Medicine for Business”. It’s in the format of a novel, from the persectives of several different characters involved in a business that makes (unspecified) widgets, and I found it to be a page-turner. I think using a fictional example helps to make a lot of things explicit that would otherwise be kind of vague, or where the author might assume the reader knows what they’re talking about, and the first half gives some great insight into what a poorly-functioning company can look like.
The central recommendation is similar to what you describe from An Everyone Culture, except that the emphasis on radical communication doesn’t include personal stuff. The main “trick” it gives for making the whole organization work is that the top management has to buy in to the extreme-honesty company-first mentality and then continually force it on everyone else until it’s universally accepted, with special attention to discovering and removing any stubborn manager who wants to protect their own turf or play power games. It claims to be based on the famously effective management system that GE used. Having little experience of corporations myself, I can’t say whether it’s a realistic approach, but the whole thing struck me as a little too neat and tidy—if it were that easy, wouldn’t everybody be doing it already?
I found this aspect of the topic particularly interesting because it elucidates the main requirement of a question, which I’d never thought of before: a theory of mind.
My cats ask me for food all the time… but this isn’t really a question, it’s a demand. Similarly, when they seek out information, it’s always a solitary endeavor. The closest they might come to an interaction with a human (or another cat) specifically for the purpose of gaining information, would be approaching or meowing with the presumed intention of provoking a reaction that illustrates the other’s mood. Even then it’s more like “try it and see what happens” rather than a cooperative communication. I don’t think they can conceive of another entity possessing information and being capable of sharing it.
Would love to hear of any counterexamples, though.
Could you explain what you mean by resource allocation? Certainly there’s a lot of political and public opinion resistance to any new technology that would help the rich and not the poor. I think that stems from the thought that it will provide even more incentive for the rich to increase inequality (a view to which I’m sympathetic), but I don’t see how it would imply that only the distribution of wealth is important...