I’ve been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I’ve worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I’ve focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I’m incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.
Seth Herd
It seems to me that public outcry could lead to a small but not large delay in developing AGI, and potentially could provide more time and funding for alignment work.
This is a clear and compelling argument for why AI development will not be regulated, and those regulations honored by they military. I think this argument also applies to AGI.
However, we’re more concerned with AGI pursued by private organizations. It seems those could be regulated, and those regulations enforced. It seems to me that the limiting factor is recognition that the Chinese government is unlikely to honor such an agreement, even if it is made, and unlikely to enforce it on Chinese companies. As far as I can tell, no other government has much likelihood of being first to AGI, even with a delay.
Therefore, I’d guess that Western citizens and governments might put in place regulations that would slow down our development of AGI by a little, but not ones that would slow it down by a lot.
It sounds like you’re thinking mostly about voluntary standards. I think legislated standards are a real possibility (as the public gets more freaked out by both powerful nonmagnetic systems like ChatGPT, and less powerful but clearly self-directed systems). I think legislated standards adhere to this tradeoff a bit less. Legislators have much less reason to care how difficult standards are to adhere to. Therefore, standards that sound good to the public are going to be a bigger criteria, and that has only an indirect relationship to both ease of implementation and actual usefulness.
You think people eat meat despite knowing the animals are essentially tortured? Or that beliefs are just less extreme?
Your points seem valid. However, it does seem to me overwhelmingly likely that there’s more suffering involved in eating factory farmed meat than eating non-meat products supplied from the global supply chain. In one case, there are animals suffering a lot and humans suffering; in the other, there are only humans suffering. I doubt that those humans would suffer less if those jobs disappeared; but that’s not even necessary to make it a clear win for avoiding factory farming for me.
Excellent point. I totally agree. I will cease using the word torture in this context in the future, because I think it gives people another way to think about something other than the thrust of the argument.
I appear boring in public so that I don’t offend anyone by appearing to claim high status or adhere to an enemy-group, and so if I can interact, I can emphasize my common perspective with anyone.
If you really have insight that could save all of humanity, it seems like you’d want to share it in time to be of use instead of trying to personally benefit from it. You’d get intellectual credit, and if we get this right we can quit competing like a bunch of monkeys and all live well. I’ve forgone sharing my best ideas and credit for them since they’re on capabilities. So: pretty please?
This seems like a better model of the terrain: we don’t know how far down which path we need to get to find a working alignment solution. So the strategy “let’s split up to search, gang; we’ll cover more ground” actually makes sense before trying to stack efforts in the same direction.
I think this is worth thinking about. One important caveat is that humans have a bunch of built-in tells; we radiate our emotions, probably in part so that we can be identified as trustworthy. Another important caveat is that sociopaths do pretty well in groups of humans, so deception isn’t all that hard. One thing tribal societies had was word of mouth; gossip is extremely important for identifying those who are trustworthy.
I’m pretty sure what he means by short timelines giving less compute overhang is this: if we were to somehow delay working on AGI for, say, ten years, we’d have such an improvement in compute that it could probably run on a small cluster or even a laptop. The implied claim here is that current generations of machines aren’t adequate to run a superintelligent set of networks, or at least it would take massive and noticeable amounts of compute.
I don’t think he’s addressing algorithmic improvements to compute efficiency at all. But it seems to me that they’d go in the same direction; delaying work on AGI would also produce more algorithmic improvements that would make it even easier for small projects to create dangerous super intelligence.
I’m not sure I agree with his conclusion that short timelines are best, but I’m not sure it’s wrong, either. It’s complex because it depends on our ability to govern the rate of progress, and I don’t think anyone has a very good guess at this yet.
He’s saying all the right things. Call me a hopeless optimist, but I tend to believe he’s sincere in his concern for the existential risks of misalignment.
I’m not sure I agree with him on the short timelines to prevent overhang logic, and he’s clearly biased there, but I’m also not sure he’s wrong. It depends on how much we could govern progress, and that is a very complex issue.
These seem like useful insights, and I like the terminology. Thanks for writing this up so clearly and succinctly!
I don’t think this goes through. If I have no preference between two things, but I do prefer to not be money-pumped, it doesn’t seem like I’m going to trade those things so as to be money-pumped.
I am commenting because I think this might be a crucial crux: do smart/rational enough agents always act like maximizers? If not, adequate alignment might be much more feasible than if we need to find exactly the right goal and how to get it into our AGI exactly right.
Human preferences are actually a lot more complex. We value food very highly when hungry and water when we’re thirsty. That can come out of power-seeking, but that’s not actually how it’s implemented. Perhaps more importantly, we might value stamp collecting really highly until we get bored with stamp collecting. I don’t think these can be modeled as a maximizer of any sort.
If humans would pursue multiple goals even if we could edit them (and were smart enough to be consistent), then a similar AGI might only need to be minimally aligned for success. That is, it might stably value human flourishing as a small part of its complex utility function.
I’m not sure whether that’s the case, but I think it’s important.
Thanks for amplifying the post by that caused your large update. It’s pretty fascinating. I haven’t thought through it enough yet to know if I find it as compelling as you do.
Let me try to reproduce the argument of that post to see if I’ve got it:
If an agent already understands the world well (e.g., by extensive predictive training) before you start aligning it (e.g., with RLHF), then the alignment should be easier. The rewards you’re giving are probably attaching to the representations of the world that you want them to, because you and the model share a very similar model of the world.
In addition, you really need long term goals for deceptive alignment to happen. Those aren’t present in current models, and there’s no obvious way for models to develop them if their training signals are local in time.
I agree with both of these points, and I think they’re really important—if we make systems with both of those properties, I think they’ll be safe.
I’m not sure that AGI will be trained such that it knows a lot before alignment starts. Nor am I sure that it won’t have long term goals. I think it will. Let alone ASI
But I think that tool AI might well continue be trained that way. And that will give us a little longer to work on alignment. But we aren’t likely to stop with tool AI, even if it is enough to transform the world for the better.
Clippy isn’t a maximizer. And neither is any current RL agent. I did mention that, but I’ll edit to make that clear.
The issue you describe is one issue, but not the only one. We do know how to train an agent to do SOME things we like. The concern is that it won’t be an exact match. The question I’m raising is: can we be a little or a lot off-target, and still have that be enough, because we captured some overlap between our and the agents values?
What I was trying to say is that RL agents DO maximize the output of its critic network—but the critic network does not reflect states of the world directly. Therefore the total system isn’t directly a maximizer. The question I’m trying to pose is whether or not it acts like a maximizer, under given particular conditions of training and RL construction.
While you’re technically correct that an NN is a mathematical function, it seems fair to say that it’s not an explicit function in the sense that we can’t read or interpret it very well.
Thank you! This is addressing the question I was trying to get at. I’ll check it out.
Thank you! It’s particularly embarrassing to write a stereotypical newbie post since I’ve been thinking about this and reading LW and related since maybe 2004, and have been a true believer in the difficulty of aligning maximizers until re-engaging recently. Your way of phrasing it clicks for me, and I think you’re absolutely correct about where most of the work is being done in this fable. This post didn’t get at the question I wanted, because it’s implying that aligning an RL model will be easy if we try. And I don’t believe that. I agree with you that shard theory requires magic. There are some interesting arguments recently (here and here) that aligning an RL system might be easy if it has a good world model when we start aligning it, but I don’t think that’s probably a workable approach for practical reasons.
It was my intent to portray a situation where much less than half of the training went to alignment, and that little bit might still be stable and useful. But I’d need to paint a less rosy picture of the effort and outcome to properly convey that.
This seems like a valid concern. It seems to apply to other directions in alignment research as well. Any approach can make progress in some directions seem easier, while ultimately that direction will be a dead end.
Based on that logic, it would seem that having more different approaches should serve as a sort of counterbalance. As we make judgment calls about ease of progress vs. ultimate usefulness, having more options would seem like to provide better progress in useful directions.