Thanks, I’ll edit the post to note I misinterpreted the paper.
Noosphere89
Correct on that.
But it might nevertheless automate most jobs within a decade or so, and then continue churning along, automating new jobs as they come up.
I think this is less likely than I did a year ago, and a lot of this is informed by Steve Newman’s blog post on a project not being a bundle of tasks.
My median expectation is we get 1-3 month 50% of tasks done by 2030, and 1 week 80% of tasks done by 2030, which under this view is not enough to automate away managers, and depending on how much benchmarks diverge from reality, may not even be enough to automate away most regular workers, and my biggest probable divergence is I don’t expect super-exponential progress to come soon enough to bend these curves up, due to putting much less weight on superexponential progress within 5 years as a result of trend breaks than you.
Here’s the link for a project is not a bundle of tasks.
I have nothing to say on the rest of your comment.
Most Algorithmic Progress is Data Progress [Linkpost]
To be completely honest, this should not be voted by basically anyone in the review, and this was just a short reaction post that doesn’t have enduring value.
I’ve come to increasingly think that being able to steelman positions, especially positions you don’t hold is an extremely important skill to be effective at truth-finding, especially in the modern era, and that steelmanning is mostly normal for effectively finding the truth, rather than being an exceptional trait.
Not doing this is a lot of the reason why political discussions tend to end up so badly.
This is why I give this post a +4.
That said, there are 2 important caveats that limit the applicability of this principle.
My prediction for why LW has been less focused on core rationality content is in broad strokes because of the fact that AI has grown more in importance, and more generally one of the lessons rationalists have learned is that object-level practice in a skill (usually) has much less diminishing returns than meta-level thinking (which is yet another example of continual learning mattering a lot for human success).
I would analogize this to a human with anterograde amnesia, who cannot form new memories, and who is constantly writing notes to keep track of their life. The limitations here are obvious, and these are limitations future Claudes will probably share unless LLM memory/continual learning is solved in a better way.
This is an extremely underrated comparison, TBH. Indeed, I’d argue that frozen weights + lack of a long-term memory are easily one of the biggest reasons why LLMs are much more impressive than useful at a lot of tasks (with reliability being another big, independent issue).It emphasizes 2 things that are both true at once: LLMs do in fact reason like humans and can have (poor-quality) world-models, and there’s no fundamental chasm between LLM capabilities and human capabilities that can’t be cured by unlimited resources/time, and yet just as humans with anterograde amnesia are usually much less employable/useful to others than people who do have long-term memory, current AIs are much, much less employable/useful than future paradigm AIs.
Some good picks the for how to design reward functions starter pack (though I should note that their empirical support is very weak due to focusing on toy models) are Defining Corrigible and Useful Goals and Defining Monitorable and Useful Goals.
The first post focuses on how you can get a goal for AIs that allow you to shutdown the AI while having the AI be useful, and the approach to corrigibility it takes is extremely different to how human brains work, using the corrigibility transformation to get corrigible AIs.
One big caveat here is that it definitely requires the assumption that Causal Decision Theory is used but I mostly am fine with that assumption, given that humans intuitively use Causal Decision Theory and it’s in the spec of the transformation rather than a background assumption.
The other big caveat is that you want the model to optimize for the reward in order for this to work, so in terms of under-sculpting vs over-sculpting, or whether an AI is driven by the reward vs driven by another goal, you want to have the AI reward-maximize and be over-sculpted (though in this case it’s just appropriately sculpted via reward), which makes it incompatible with corrigibility/alignment hopes that depend on AIs not maximizing the reward, but I think this is a good property to have.
The post on defining monitorable and useful goals proposes the idea of the monitorability transformation to get AIs to not be incentivized to fool monitors generally, and I’d recommend reading that over any explanation I’d give.
These are admittedly curveballs compared to standard LW thoughts on this, but this is why I picked them for the reward functions starter pack, as they contain novel ideas to deal with some notorious problems.
Consider an agent reasoning: “What kind of process could have produced me?” If the agent is literally the argmax of some simple scoring function, then the selection process must have enumerated all possible agents, evaluated f on each, and picked the maximum. This is physically unrealizable: it requires resources exceeding what’s available in the environment. So the agent concludes that it wasn’t generated by the argmax.
This is the invalid step of reasoning, because for AIXI agents, the environment is allowed to have unlimited resources/be very complicated by construction, and you can have environments which do allow you to do the literal search procedure.
This is why AIXI is usually considered in an unbounded setting, where we give AIXI unlimited resources for memory and time like a Universal Turing Machine, and is given certain oracular powers to make it possible to actually use AIXI to do inference or planning.
You underestimate how complicated and resource-rich environments are allowed to be.
Another gloss: we can’t define what it means for an embedded agent to be “ideal” because embedded agents are messy physical systems, and messy physical systems are never ideal. At most they’re “good enough”. So we should only hope to define when an embedded agent is good enough. Moreover, such agents must be generated by a physically realistic selection process.
This is very dependent on what the rules of the environment are, and embedded agents can be ideal in certain environments.
I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid.
I want to flag here that the version of great man theory that was debunked by modern sociology is the claim that big impacts on the world are always/almost always are caused by great men, not that great men can’t have big impacts on the world.For what it’s worth, I actually disagree with this view, and think that one of the bigger things LW gets right is that people’s impact in a lot of domains is pretty heavy-tailed, and certain things matter way more than others under their utility function.
I do agree that people can round the impact off to infinity for rare geniuses, and there is a point to be made about LWers overvaluing theory/curiosity driven tasks compared to just using simple baselines and doing what works (and I agree with this critique), but the appreciation of heavy-tailed impact is one of the things I most value about LW, and while there are problems that do stem from this, I also think it’s important not to damage the appreciation of heavy-tailed impact too much in solving the problems (assuming the heavy-tailed hypothesis is true, which I largely believe).
especially as compute will only keep scaling until ~2030, and then the amount of fuel for exploring algorithmic ideas won’t keep growing as rapidly
Technical flag that compute scaling will slow down to the historical Moore’s law trend plus historical fab buildout times, it won’t completely stop, which means it’ll go down from 3.5x per year to 1.55x per year, but yes this does take some wind out of the sails of algorithmic progress (though it’s helpful to note that even post-LLM scaling, we’ll be able to simulate human brains passably by the late 2030s, speeding up progress to AGI).
Another potential implication is that we should be more careful when talking about misalignment in LLMs, as misalignment might be due to the model being gaslighted into believing that it’s capable of doing something it isn’t.
This would affect the interpretation of the examples Habryka gave below:
The main reason I was talking about RNNs/neuralese architectures is that it fundamentally breaks the assumption that nice CoT is the large source of evidence on safety that it is today. Quite a lot of near-term safety plans assume that the CoT is where the bulk of the relevant work is, meaning it’s pretty easy to monitor.
To be precise, I’m claiming this part of your post is little evidence for safety post-LLMs:
I have gone over a bunch of the evidence that contradicts the classical doom picture. It could almost be explained away by noting that capabilities are too low to take over, if not for the fact that we can see the chains of thought and they’re by and large honest.
This also weakens the influence of pre-training priors, as the AI continually learns, compared to how AIs stop learning today after their training, which is why they have knowledge cutoff dates, meaning we can’t rely on it to automate alignment (though the human pre-training prior is actually surprisingly powerful, and I expect it to be able to complete 1-6 month tasks by 2030, which is when scaling starts to slow down most likely absent TAI arising then, which I think is reasonably likely).
Agree that interpretability might get easier, and this is a useful positive consideration to think about.
So I’ve become more pessimistic about this sort of outcome happening over the last year, and a lot of it comes down to me believing in longer timelines, which means I expect the LLM paradigm to be superseded, and the most likely options in this space that increase capabilities unfortunately correspond to more neuralese/recurrent architectures. To be clear, they’re hard enough to make work that I don’t expect them to outperform transformers so long as transformers can keep scaling, and even after the immense scale-up of AI that will slow down to the Moore’s law trend by 2030-2031, I still expect neuralese to be at least somewhat difficult, and it’s possibly difficult enough that we might survive because AI never reached the capabilities we expect.
(Yes, the scaling hypothesis in a weak form will survive, but I don’t expect the strong versions of the scaling hypothesis to work. Reasons for why I believe this are available on request).
It’s still very possible this happens, but I wouldn’t put much weight on this for planning purposes.
I agree with a weaker version of the claim that says that the AI safety landscape is looking better than people thought 10-15 years ago, with AI control probably being the best example here. To be clear, I do think this is actually meaningful and does matter, primarily because they focus less on the limiting cases of AI competencies, but I currently am not as optimistic about some of the important properties of LLMs that are relevant for alignment surviving for newer AI designs, which means I disagree with this post.
You should actually tag @Vladimir_Nesov instead of Vladimir M, as Vladimir Nesov was the original author.
Or early AGIs convince/coerce humanity into not rushing to superintelligence before it’s clear how to align it with anyone’s well-being (including that of the early AGIs).
BTW, this sort of thing (where the AI also has an interest in slowing down progress) is one of the reasons why AI safety plans that depend on a certain level of capabilities being hit might not fall apart, as AI being slowed down lets us stay in the sweet spot longer.
This does rely on the assumption that it’s very hard to solve the alignment problem even for AGIs, which isn’t given much likelihood in my models of the world, but this sort of thing could very well prevent human extinction even in worlds where AI alignment is very hard and we don’t get much regulation of AI progress from now.
Another reason why people tend to grow organizations, especially in middle management positions, is because coordination is a key constraint, and anything that loosens this constraint, even if it damages a lot of things is often worth it because coordination is one of the few areas where diminishing returns don’t apply as early.
This is part of a more general trend of middlemen being more important than ever, (and that’s necessary actually to run modern societies).
So any solution to this sort of problem would implicitly be a solution to coordination problems in general.
However, there’s an important dynamic this model misses, which is that, when the detector becomes less effective, the model’s capabilities might also decline. For example, suppose a lie detector functions because it leverages some important representations that help a model reason about its situation. A single gradient step could make the lie detector less reliable, but to do so, it might need to distort some of those important internal representations. And as a result, the model would become worse at e.g., writing code. So it’s possible that reducing the effectiveness of a detector requires paying a tax.
An important implication of this result is that good detectors of misalignment should have the property that either AIs will be detected, or if the AI can undetectably be misaligned, it should have lower capabilities compared to the hypothetical aligned AI.And this is why even if we never get holy-grail interpretability until we face misalignment risk from AI, if at all, interpretability research is still useful even if we cannot explain everything that’s going on in the model, because you can use this to make detectors that make AIs pay a tax for undetectable misalignment.
Similar stories hold for AI control, and it’s a big reason why I like the fact that AI control is getting funded right now.
The other part is that partisans of the narrative overfocus on the other side’s bad arguments because of most people not being able to check the arguments, and to be frank the entire area is a mess that I’m not willing to go down, and I instead focus on less-charged topics.
Like, all 4 of the examples are great demonstrations of why you need to be able to steelman your opponent, and one of the central problems in politics is people are trapped in a loop of destroying bad arguments instead of focusing on good arguments.
This is my biggest disagreement at the moment, and the reason is unlike 2008 or 2020, there’s no supply squeeze or financial consequences severe enough that banks start to fail, and I expect an AI bubble to look more like the 2000 bubble than the 2008 or 2020 bubbles/crises.
That said, AI stocks would fall hard and GPUs would become way, way cheaper.