An underrated answer is that humans are very, very dependent on other people to survive, and we have easily the longest childhood where we are vulnerable of any mammal, and even once we do become an adult, we are still really, really bad at surviving on our own compared to other animals, and since we are K-selected, every dead child mattters a lot in evolution, so it’s very, very difficult for sociopathy to be selected for.
Noosphere89
(Crossposted from EAF)
Nice write-up on the issue.
One thing I will say is that I’m maybe unusually optimistic on power concentration compared to a lot of EAs/LWers, and the main divergence I have is that I basically treat this counter-argument as decisive enough to make me think the risk of power-concentration doesn’t go through, even in scenarios where humanity is basically as careless as possible.
This is due to evidence on human utility functions showing that most people have diminishing returns on utility on exclusive goods to use personally that are fast enough that altruism matters much more than their selfish desires on stellar/galaxy wide scales, combined with me being a relatively big believer in quite a few risks like suffering risks being very cheap to solve via moral trade where most humans are apathetic on.
More generally, I’ve become mostly convinced of the idea that a crucial positive consideration on any post-AGI/ASI future is that it’s really, really easy to prevent most of the worst things that can happen in those futures under a broad array of values, even if moral objectivism/moral realism is false and there isn’t much convergence on values amongst the broad population.
Edit: I edited in a link.
whether to pursue an aggressive (stock-heavy) or conservative (bond-heavy) investment strategy. if there is an ai bubble pop, it will likely bring the entire economy into a recession.
This is my biggest disagreement at the moment, and the reason is unlike 2008 or 2020, there’s no supply squeeze or financial consequences severe enough that banks start to fail, and I expect an AI bubble to look more like the 2000 bubble than the 2008 or 2020 bubbles/crises.
That said, AI stocks would fall hard and GPUs would become way, way cheaper.
Thanks, I’ll edit the post to note I misinterpreted the paper.
Correct on that.
But it might nevertheless automate most jobs within a decade or so, and then continue churning along, automating new jobs as they come up.
I think this is less likely than I did a year ago, and a lot of this is informed by Steve Newman’s blog post on a project not being a bundle of tasks.
My median expectation is we get 1-3 month 50% of tasks done by 2030, and 1 week 80% of tasks done by 2030, which under this view is not enough to automate away managers, and depending on how much benchmarks diverge from reality, may not even be enough to automate away most regular workers, and my biggest probable divergence is I don’t expect super-exponential progress to come soon enough to bend these curves up, due to putting much less weight on superexponential progress within 5 years as a result of trend breaks than you.
Here’s the link for a project is not a bundle of tasks.
I have nothing to say on the rest of your comment.
To be completely honest, this should not be voted by basically anyone in the review, and this was just a short reaction post that doesn’t have enduring value.
I’ve come to increasingly think that being able to steelman positions, especially positions you don’t hold is an extremely important skill to be effective at truth-finding, especially in the modern era, and that steelmanning is mostly normal for effectively finding the truth, rather than being an exceptional trait.
Not doing this is a lot of the reason why political discussions tend to end up so badly.
This is why I give this post a +4.
That said, there are 2 important caveats that limit the applicability of this principle.
My prediction for why LW has been less focused on core rationality content is in broad strokes because of the fact that AI has grown more in importance, and more generally one of the lessons rationalists have learned is that object-level practice in a skill (usually) has much less diminishing returns than meta-level thinking (which is yet another example of continual learning mattering a lot for human success).
I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.
This is an extremely underrated comparison, TBH. Indeed, I’d argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there’s no fundamental chasm between LLM capabilities and human capabilities that can’t be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs.
Some good picks the for how to design reward functions starter pack (though I should note that their empirical support is very weak due to focusing on toy models) are Defining Corrigible and Useful Goals and Defining Monitorable and Useful Goals.
The first post focuses on how you can get a goal for AIs that allow you to shutdown the AI while having the AI be useful, and the approach to corrigibility it takes is extremely different to how human brains work, using the corrigibility transformation to get corrigible AIs.
One big caveat here is that it definitely requires the assumption that Causal Decision Theory is used but I mostly am fine with that assumption, given that humans intuitively use Causal Decision Theory and it’s in the spec of the transformation rather than a background assumption.
The other big caveat is that you want the model to optimize for the reward in order for this to work, so in terms of under-sculpting vs over-sculpting, or whether an AI is driven by the reward vs driven by another goal, you want to have the AI reward-maximize and be over-sculpted (though in this case it’s just appropriately sculpted via reward), which makes it incompatible with corrigibility/alignment hopes that depend on AIs not maximizing the reward, but I think this is a good property to have.
The post on defining monitorable and useful goals proposes the idea of the monitorability transformation to get AIs to not be incentivized to fool monitors generally, and I’d recommend reading that over any explanation I’d give.
These are admittedly curveballs compared to standard LW thoughts on this, but this is why I picked them for the reward functions starter pack, as they contain novel ideas to deal with some notorious problems.
Consider an agent reasoning: “What kind of process could have produced me?” If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what’s available in the environment. So the agent concludes that it wasn’t generated by the argmax.
This is the invalid step of reasoning, because for AIXI agents, the environment is allowed to have unlimited resources/be very complicated by construction, and you can have environments which do allow you to do the literal search procedure.
This is why AIXI is usually considered in an unbounded setting, where we give AIXI unlimited resources for memory and time like a Universal Turing Machine, and is given certain oracular powers to make it possible to actually use AIXI to do inference or planning.
You underestimate how complicated and resource-rich environments are allowed to be.
Another gloss: we can’t define what it means for an embedded agent to be “ideal” because embedded agents are messy physical systems, and messy physical systems are never ideal. At most they’re “good enough”. So we should only hope to define when an embedded agent is good enough. Moreover, such agents must be generated by a physically realistic selection process.
This is very dependent on what the rules of the environment are, and embedded agents can be ideal in certain environments.
I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid.
I want to flag here that the version of great man theory that was debunked by modern sociology is the claim that big impacts on the world are always/almost always are caused by great men, not that great men can’t have big impacts on the world.For what it’s worth, I actually disagree with this view, and think that one of the bigger things LW gets right is that people’s impact in a lot of domains is pretty heavy-tailed, and certain things matter way more than others under their utility function.
I do agree that people can round the impact off to infinity for rare geniuses, and there is a point to be made about LWers overvaluing theory/curiosity driven tasks compared to just using simple baselines and doing what works (and I agree with this critique), but the appreciation of heavy-tailed impact is one of the things I most value about LW, and while there are problems that do stem from this, I also think it’s important not to damage the appreciation of heavy-tailed impact too much in solving the problems (assuming the heavy-tailed hypothesis is true, which I largely believe).
especially as compute will only keep scaling until ~2030, and then the amount of fuel for exploring algorithmic ideas won’t keep growing as rapidly
Technical flag that compute scaling will slow down to the historical Moore’s law trend plus historical fab buildout times, it won’t completely stop, which means it’ll go down from 3.5x per year to 1.55x per year, but yes this does take some wind out of the sails of algorithmic progress (though it’s helpful to note that even post-LLM scaling, we’ll be able to simulate human brains passably by the late 2030s, speeding up progress to AGI).
Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
This would affect the interpretation of the examples Habryka gave below:
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
I have gone over a bunch of the evidence that contradicts the classical doom picture. It could almost be explained away by noting that capabilities are too low to take over, if not for the fact that we can see the chains of thought and they’re by and large honest.
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.
So I’ve become more pessimistic about this sort of outcome happening over the last year, and a lot of it comes down to me believing in longer timelines, which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures. To be clear, they’re hard enough to make work that I don’t expect them to outperform transformers so long as transformers can keep scaling, and even after the immense scale-up of AI that will slow down to the Moore’s law trend by 2030-2031, I still expect neuralese to be at least somewhat difficult, and it’s possibly difficult enough that we might survive because AI never reached the capabilities we expect.
(Yes, the scaling hypothesis in a weak form will survive, but I don’t expect the strong versions of the scaling hypothesis to work. Reasons for why I believe this are available on request).
It’s still very possible this happens, but I wouldn’t put much weight on this for planning purposes.
I agree with a weaker version of the claim that says that the AI safety landscape is looking better than people thought 10-15 years ago, with AI control probably being the best example here. To be clear, I do think this is actually meaningful and does matter, primarily because they focus less on the limiting cases of AI competencies, but I currently am not as optimistic about some of the important properties of LLMs that are relevant for alignment surviving for newer AI designs, which means I disagree with this post.
You should actually tag @Vladimir_Nesov instead of Vladimir M, as Vladimir Nesov was the original author.
Or early AGIs convince/coerce humanity into not rushing to superintelligence before it’s clear how to align it with anyone’s well-being (including that of the early AGIs).
BTW, this sort of thing (where the AI also has an interest in slowing down progress) is one of the reasons why AI safety plans that depend on a certain level of capabilities being hit might not fall apart, as AI being slowed down lets us stay in the sweet spot longer.
This does rely on the assumption that it’s very hard to solve the alignment problem even for AGIs, which isn’t given much likelihood in my models of the world, but this sort of thing could very well prevent human extinction even in worlds where AI alignment is very hard and we don’t get much regulation of AI progress from now.
Another reason why people tend to grow organizations, especially in middle management positions, is because coordination is a key constraint, and anything that loosens this constraint, even if it damages a lot of things is often worth it because coordination is one of the few areas where diminishing returns don’t apply as early.
This is part of a more general trend of middlemen being more important than ever, (and that’s necessary actually to run modern societies).
So any solution to this sort of problem would implicitly be a solution to coordination problems in general.
The weaker version of pure reward optimization in humans is basically just the obesity issue, and the biggest reason humans became more obese on a population level in the 20th and 21st century is because we have figured out how to goodhart human reward models in the domain of food such that sugary, fatty and high-calorie foods (with the high-calorie part being most important for obesity) are very, very highly rewarding to at least part of our brains.
Essentially everything has gotten more palatable and rewarding to eat than pre-20th century food.
And as you say, part of the issue is that drugs are very crippling for capabilities, whereas reward functions for AIs that are optimized will have much less of this issue, and optimizing the reward function likely makes AIs more capable.