Yep, I think this kills it. I have a sort of argument in my head that nothing can emit energy slower than a black hole due to hawking radiation.
Assuming protons don’t decay and that there’s no big rip, I feel like you can do obnoxiously large numbers. Build a clock out of superconductors that consumes 0 power until the top bit flips over (since incrementing a counter is reversible, this should be possible). Then, when your “alarm” goes off, wake up the sentient being and let it have its last thought. The limit is now on the number of bits B in your clock. Assume this is somewhere between 10**67 (number of atoms in the galaxy) and 10**80 (number of atoms in the visible universe). Your wake-up time is now 2**(10**B).
It certainly seems like it fits in that category.
Let’s take a concrete example.
Assume you have an AI that could get 100% on every Putnam test, do you think it would be reasonable or not to assume such an AI would also display superhuman performance at solving the Yang-Mills Mass Gap?
This doesn’t include working out advances in fundamental physics, or designing a fusion reactor, or making breakthroughs in AI research.
Why don’t all of these fall into the self-play category? Physics, software and fusion reactors can all be simulated.
I would be mildly surprised if a sufficiently large language model couldn’t solve all of Project Euler+Putnam+MATH dataset.
I strongly doubt we live in a data-limited AGI timeline
Humans are trained using much less data than Chinchilla
We haven’t even begun to exploit forms of media other than text (Youtube alone is >2OOM bigger)
self-play allows for literally limitless amounts of data
regularization methods mean data constraints aren’t nearly as important as claimed
In the domains where we have exhausted available data, ML models are already weakly superhuman
Since I recently wrote an article endorsing Factorization as an alignment approach, I feel like I should respond here.
Everyone who proposes Factorization agrees there is a tradeoff between factorization and efficiency. The question is, how bad is that tradeoff?
Factorization is not a solution to the problem of general intelligence. However there are a lot of problems that we should reasonably expect can be factorized
Each human having 5 minutes with Google doc does not seem like a good way to factorize problems
John seems wrongly pessimistic about the “Extremely Long Jury Trial”. We know from math that “you prove something, I check your work” is an extremely powerful framework. I would expect this to be true in real life as well.
I’m not quite sure what you mean here.
In the standard picture of a reinforcement learner, suppose you get to specify the reward function and i get to specify the “agent”. No matter what reward function you choose, I claim I can make an agent that both: 1) gets a huge reward compared to some baseline implementation 2) destroys the world. In fact, I think most “superintelligent” systems have this property for any reward function you could specify using current ML techniques.
Now switch the order, I design the agent first and ask you for an arbitrary reward function. I claim that there exist architectures which are: 1) useful, given the correct reward function 2) never, under any circumstances destroy the world.
What loss function(s), when sent into a future AI’s brain-like configuration of neocortex / hippocampus / striatum / etc.-like learning algorithms, will result in an AGI that is definitely not trying to literally exterminate humanity?
Specifying a correct loss functions is not the right way to think about the Alignment Problem. A system’s architecture matters much more than its loss function for determining whether or not it is dangerous. In fact, there probably isn’t even a well-defined loss function that would remain aligned under infinite optimization pressure.
Like, I see this plan as basically saying “yep, that hard problem is in fact too hard, let’s try to dodge it, by having humans + narrow AI services perform the pivotal act”. Setting aside how I don’t particularly expect this to work, we can at least hopefully agree that it’s attempting to route around the problems that seem to me to be central, rather than attempting to solve them.
I think you’re being overly critical of this approach. We can build pretty useful AI without getting anywhere near your “sharp left turn”. For example, the “Strawberry problem” is well within the domain of things narrow AI could solve.
Importantly, if the Alignment Problem turns out to be ill-defined/unsolvable, combining a bunch of narrow AIs into something useful is the only workable approach we’ve got.
Like, yes, some directions are easier to point an AI in, on account of the direction itself being simpler to conceptualize, but that observation alone doesn’t say anything about how to determine which direction an AI is pointing after it falls into the capabilities well.
Identifying which concepts are attractor basins and figuring out how to build a useful AGI out of those concepts sounds like a complete solution to the Alignment Problem. Later in the article you seem positive about ELK, so maybe we agree on this fact.
Listening to the context there, it sounds like what Ben is saying is once we’ve solved the alignment problem eventually we will trust the aligned AI to make decisions we don’t understand. Which is a very different claim from saying that merely because the AI is intelligent and hasn’t done anything harmful so far it is trustworthy.
I also don’t fully understand why he thinks it will be possible to use formal-proof to align human-level AI, but not superhuman AI. He suggests there is a counting argument, but it seems if I could write a formal proof for “won’t murder all humans” that works on a human-level AGI, that proof would be equally valid for superhuman AGI. The difficulty is that formal mathematical proof doesn’t really work for fuzzy-defined words like “human” and “murder”, not that super-intelligence would transform those (assuming they did have a clean mathematical representation). This is why I’m pessimistic about formal proof as an alignment strategy generally.
In fact, if it turned out that human value had a simple-to-define core, then the Alignment problem would be much easier than most experts expect.
I’m not personally on board with
2) we expect to extend more trust to the AI even as capabilities increase
The more capable an AI is, the more paranoid we should be about it. GPT-2 was bad enough you can basically give it to anyone who wanted it. GPT-3 isn’t “dangerous” but you should at least be making sure it isn’t being used for mass misinformation campaigns or something like that. Assuming GPT-4 is human-level, it should be boxed/airgapped and only used by professionals with a clear plan to make sure it doesn’t produce dangerous outputs. And if GPT-5 is super-intelligent (> all humans combined), even a text-terminal is probably too dangerous until we’ve solved the alignment problem. The only use cases where I would even consider using an unaligned GPT-5 is if you could produce a formal proof that its outputs were what you wanted.
I’m curious to know if you expect explainability to increase in correlation with capability? i.e. or can we use Ben’s analogy that ‘I expect my dog to trust me, both bc I’m that much smarter, and I have a track-record of providing food/water for him’ ?
Don’t agree with this at all. Explainability/alignment/trustworthiness are all pretty much orthogonal to intelligence.
The claim that AI capabilities research is bad for Alignment is non-obvious.
In particular, if you think of AI-Alignment as a race between knowledge and hardware (more knowledge makes aligning AI easier, more hardware makes building dangerous AI easier), then AI research (that isn’t purely building faster hardware) is net-positive.
Mostly just here to say “I agree”, especially regarding
Similarly, I think Eliezer’s reasoning about convergent incentives and the deep nature of consequentialism is too sloppy to get to correct conclusions and the resulting assertions are wildly overconfident.
I think that if you really dive into any of these key points you will quickly reach details where Eliezer cannot easily defend his view to a smart disinterested audience.
A lot of EY’s points follow naturally if you think that the first AGI will be a recursively self improving maximally Bayesian reinforcement learner that fooms into existence as soon as someone invents the right metaheuristic. In this world we should be really worried about whether e.g. corrigibility is natural in some platonic sense or there is a small core to human alignment.
In Paul’s world, AGI is the result of normal engineering, just at a scale 1000-1000x what OpenAI and DeepMind are doing now. In this world, it makes sense to talk about building large coalitions and really understanding what’s going on in the guts of existing Deep Learning algorithms.
I think Paul’s timelines (~15% on singularity by 2030 and ~40% on singularity by 2040) are a little conservative. Personally I estimate >50% by 2030, but Paul’s story of how AGI gets built makes a lot more sense than EY’s. And this goes a long way to explaining why I think the world is less doomed and we should focus less on a small team of people performing a Pivotal Act and more on than EY does.
I am on team “Objection 1.”
While in theory, ERM will result in a model utilising spurious correlations to get lower training loss, in practice this won’t be a big issue.
Specifically, dropout layers in neural networks are a form of regularization that should eliminate spurious correlations. Will there be a slightly improved version of regularization before we reach AGI? Probably. Does this seriously affect the credibility of the Scaling Hypothesis? No.
Worth pointing out that although go anywhere/level 5 self-driving isn’t a solved problem, computers are already more reliable than humans in the domain of highway driving.
I do agree with the general point that AI fails (often in bizarre ways) at a rate far too high to be acceptable for most safety-critical tasks.
I don’t think IDA is identical to the BoAI approach. One important point is there are lots of problems we care about that we can solve without ever building a single superhuman AGI. BoAI is also explicitly not HCH as the agents are human-level but not human-like.
I’d also like to see more game theory. IDA seems to just sort of assume all of the humans are correct and benevolent.
Same general idea, but with more checks-and-balances. 10^17 is almost certainly too large, for example.
I’m considering building some toy versions literally using GPT just to get a feel for how systems like this behave in the real world.