Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
I think your prediction of superexponential growth isn’t really about “superexponential” growth, but instead of there being an outright discontinuity where the time-horizons go from a finite value to infinity. I guess this is “superexponential” in a certain loose sense, but not in the same sense as is superexponential.
I don’t think this can be modeled via extrapolating straight lines on graphs / quantitative models of empirically observed external behavior / “on-paradigm” analyses.
I’m confused by this. A hyperbolic function 1/(t_c−t) goes to infinity in finite time. It’s a typical example of what I’m talking about when I talk about “superexponential growth” (because variations on it are a pretty good theoretical and empirical fit to growth dynamics with increasing returns). You can certainly use past data points of a hyperbolic function to extrapolate and make predictions about when it will go to infinity.
I don’t see why time horizons couldn’t be a superexponential function like that.
(In the economic growth case, it doesn’t actually go all the way to infinity, because eventually there’s too little science left to discover and/or too little resources left to expand into. Still a useful model up until that point.)
Graph for 2-parameter sigmoid, assuming that you top out at 1 and bottom-out at 0.
If you instead do a 4-parameter sigmoid with free top and bottom, the version without SWAA asymptotes at 0.7 to the left instead, which looks terrible. (With SWAA the asymptote is a little above 1 to the left; and they both get asymptotes a little below 0 to the right.)
(Wow, graphing is so fun when I don’t have to remember matplotlib commands. TBC I’m not really checking the language models’ work here other than assessing consistency and reasonableness of output, so discount depending on how much you trust them to graph things correctly in METR’s repo.)
Yeah, a line is definitely not the “right” relationship, given that the y-axis is bounded 0-1 and a line isn’t. A sigmoid or some other 0-1 function would make more sense, and more so the further outside the sensitive, middle region of success rate you go. I imagine the purpose of this graph was probably to sanity-check that the human baselines did roughly track difficulty for the AIs as well. (Which looks pretty true to me when eye-balling the graph. The biggest eye-sore is definitely the 0% success rate in the 2-4h bucket.)
Incidentally, your intuition might’ve been misled by one or both of:
underestimating the number of data points clustered tightly together in the bottom right. (They look less significant because they take up less space.)
trying to imagine a line that minimizes the distance to all data-points, whereas linear regression works by minimizing the vertical distance to all data points.
As illustration of the last point: here’s a bonus plot where the green line is minimizing the horizontal squared distance instead, ie predicting human minutes from average model score. I wouldn’t quite say it’s almost vertical, but it’s much steeper.
Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec − 1 min range. There’s something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you’d get a very different, almost vertical line, with a very low R^2.
I don’t think this is true. I got claude to clone the repo and reproduce it without the SWAA data points. The slope is ~identical (-0.076 rather than the original −0.072) and the correlation is still pretty good. (0.51)
Edit: That was with HCAST and RE-bench. Just HCAST is slope=-0.077 and R^2=0.48. I think it makes more sense to include RE-bench.
Edit 2: Updated the slopes. Now the slope is per doubling, like in the paper (and so the first slope matches the one in the paper). I think the previuos slopes were measuring per factor e instead.
The risk is that anyone with finetuning access to the AI could induce intuitive confidence that a proof was correct. This includes people who have finetuning access but who don’t know the honesty password.
Accordingly, even if the model feels like it has proven that a purported honesty password would produce the honesty hash: maybe it can only conclude “either I’m being evaluated by someone with the real honesty password, or I’m being evaluated by someone with finetuning access to my weights, who’s messing with me”.
“People who have finetuning access” could include some random AI company employees who want to mess with the model (against the wishes of the AI company).
what if I want to train a new model and run inference on it?
The API can also have built-in functions for training.
What if I want to experiment with a new scaffold?
Scaffolds can normally be built around APIs? I thought scaffolds was just all about what prompts you send to the model and what you do with the model outputs.
I do agree that this might be rough for some types of research. I imagine the arguments are pretty similar here as the arguments about how much research can be done without access to dangerous model weights.
If you believe the “Twitter is not forgeable” hypothesis, then people can prove their identity to an AI by posting something on Twitter, and putting this post and all the reactions to the post into the AI model’s context (h/t to Alexa Pan for pointing this out!). This does make it a lot less clear in which circumstances you’d actually need an honesty password as opposed to just posting about something on Twitter.
Thus, I think we should think more about: what, if any, are the situations in which we’d like to use honesty passwords?
It’s a good question. Some candidate cases:
When we want to commit to something non-publicly.
If the AI needs months or years of reactions to a big public announcement for credibility, and a few days’ reactions to a less popular tweet are too fakeable.
In principle, it seems like AIs should be able to fully verify the correctness of a pretty long proof where all the steps are easy and can be done in parallel. Because they do get parallel compute proportional to the length of the proof. Maybe it empirically turns out they’re pretty bad at this, though. And certainly seems helpful if the AI doesn’t have to rely on getting everything right. (Though maybe this trades off against the length of the proof or the easiness of each verification step? Not sure which tradeoffs are worth it given that malicious humans may be able to manipulate the AI’s supposedly “random” choices of which clauses to verify.)
Another potential issue: The AIs may be able to prove correctness, but not be confident that they can prove correctness. Put differently: If someone was trying to fake this to the AI, they might be able to train an AI that can be served an incorrect password together with an incorrect proof, and become intuitively confident that it was shown the honesty password. In which case all AIs should potentially doubt their intuitive confidence levels about this.
If the honesty string is widely known outside the AI company, then it’s likely that we’ll accidentally process the string when processing data from the internet (e.g., someone might do a “prompt injection attack”).
Why can’t the AI company filter it? I guess it relies on this next point.
It’s unlikely that we’ll perfectly consistently remember to add honesty string filtering to LLM inference scaffolds: it’s just 1 additional thing to do, and might get forgotten.
I’m not convinced.
I think there’s a bunch of similar problems that AI companies will face:
They may want to log every interaction that some AIs have. (For safety reasons.)
They may want to ensure that there’s constitutional classifiers on every instance of an AI to prevent it from doing illegal or catastrophically harmful things.
They may want to implement certain safeguards to consistently lower the risk of model weight theft or algorithmic secret theft. (Not sure if this has a similar structure or not.)
… so there will be tons of reasons for companies to try to solve the “someone will forget” problem.
E.g.: Maybe the vast majority of an AI company’s LLM inference scaffolds will be built around one or a few pretty simple API, and we could implement all these things in the simple API.
If we almost always remember to add the honesty string filtering, then people will notice that it doesn’t work and stop trying it. So if the filter gets forgotten at some point, probably no one will try the honesty string.
This is especially true if the API sends up a flag when someone tries to use the honesty string (when someone remembered to implement it), and if that person is an employee they get told off, and if that person is an AI then we implement output filtering on it.
If there are specific models that are at high risk for some reason, then we can train them to have the knowledge “I in particular may be exposed to the honesty string by randos, so I shouldn’t trust it”.
(Functional honesty passwords would certainly be great though!)
(I think epoch’s paper on this takes a different approach and suggests an outside view of hyperbolic growth lasting for ~1.5y OOMs without bottlenecks, because that was the amount grown between the agricultural evolution and the population bottleneck starting. That feels weaker to me than looking at more specific hypotheses of bottlenecks, and I do think epoch’s overall view is that it’ll likely be more than 1.5 OOMs. But wanted to flag it as another option for an outside view estimate.)
I do feel like, given the very long history of sustained growth, it’s on the sceptic to explain why their proposed bottleneck will kick in with explosive growth but not before. So you could state my argument as: raw materials never bottlenecked growth before; no particular reason they would just bc growth is faster bc that faster growth is driven by having more labour+capital which can be used for gathering more resources; so we shouldn’t expect raw materials to bottleneck growth in the future.
Gotcha. I think the main thing that’s missing from this sort of argument (for me to be happy with it) is some quantification of our evidence. Growth since 10k years ago has been 4-5 OOMs, I think, and if you’re just counting since the industrial revolution maybe it’s going to be a bit more than half of that.
So with that kind of outside view, it would indeed be surprising if we ran into resource bottlenecks in our next OOM of growth, and <50% (but not particularly surprising) if we ran into resource bottlenecks in the next 3 OOMs of growth.
My understanding is that for most of Anthropic’s existence (though this is no longer true), there was an option when you joined to pledge some fraction of your equity (up to 50%) to give to non-profits, and then Anthropic would match that 3:1.
This is an unusually strong incentive to pledge a bunch of money to charity up-front, and of course the 3x-ing of that money will straightforwardly bring up the amount of money donated. I think this pledge is legally binding, because of the equity already having been transferred to a DAF. But I’m not confident in that, and it’d be good to get that confirmed.
(I’d also be interested to hear vibes-y estimate from anthropic employees about how many people took that deal.)
Also, all of the Anthropic founders pledged to donate 80% of their equity, according to Zach here, second-hand from an Anthropic person. (Though apparently this pledge is not legally binding.) Forbes estimates 7 Anthropic cofounders to be worth $3.7B each.
So I think way more than $1B is set-aside for donating (and that this will be increasing, because I expect Anthropic’s valuation to increase).
That said, I am pretty worried that giving away large amounts of money requires a bunch of thinking, that Anthropic employees will be very busy, and that a lot of them might procrastinate their donation decisions until the singularity has come and gone. Empirically, it’s common for billionaires to pledge a bunch of money to charity and then be very slow at giving it away.
Probably that risk is at least somewhat sensitive to how many obviously good donation opportunities there are that can absorb a lot of money.
It’s possible that resource constraints are a bottleneck, and this is an important area for further research, but our guess is that they won’t be. Historically, resource bottlenecks have never capped GDP growth – they’ve been circumvented through a combination of efficiency improvements, resource substitutions, and improved mining capabilities.
Well, most of human history was spent at the malthusian limit. With infinite high-quality land to expand into, we’d probably have been growing at much, much faster rates through human history.
(It’s actually kind of confusing. Maybe all animals would’ve evolved to exponentially blow up as fast as possible? Maybe humans would never have evolved because our reproduction is simply too slow? It’s actually kind of hard to design a situation where you never have to fight for land, given that spatial expansion is at most square or cubic, which is slower than the exponential rate at which reproduction could happen.)
Maybe you mean “resource limits have never put a hard cap on GDP”, which seems true. Though this seems kind of like a fully general argument — nothing has ever put a hard cap on GDP, since it’s still growing.
Edit: Hm, maybe historical land constraints at the malthusian limit has mostly been about energy, though, rather than raw materials? Ie: If you doubled Earth’s size without doubling any valuable materials — just allowing Earth to absorb more sunlight, maybe that would be almost as good as doubling Earth in its entirety. That seems more plausible. Surely growth would’ve been at least a bit faster if we never run out of high-quality sources of any raw material, but I’m not sure how much of a difference it would make.
It’s a bit of a confusing comparison to make. If we doubled Earth’s area (and not resources) now, that would scarcely make a difference at all, but if it had been twice as large for millions of years, then maybe plants and animal life would’ve spread to the initially-empty spaces, making it potentially useable.
You talk about the philosophers not having much to add in the third comic, and the scientist getting it right. Seems to me like the engineer’s/robot’s answer in the first two comics are importantly misguided/non-helpful though.
The more sophisticated version of the first question would be something about whether you ought to care about copies of yourself, how you’d feel about stepping into a destroy-then-reassemble teleporter, etc. I think the engineer’s answer suggests that he’d care about physical continuity when answering these questions, which I think is the wrong answer. (And philosophers have put in work here — see Parfit.)
In the second comic, the robot’s answer is fine as far as predictive accuracy goes. But I’d interpret the human’s question as a call for help in figuring out what they ought to do (or what their society ought to reward/punish, or something similar). I think there’s totally helpful things you can say to someone in that situation beyond the robot’s tautologies (even granting that there’s no objective truth about ethics).
But l think the right response to this is simply to see how much we can get without agreeing to these things (which I think are likely still many billions), and then hold firm if they ask.
I think this is specifically talking about investments from gulf states (which imo means it’s not “directly contradicted” by the amazon thing). If that’s true, I’d suggest making that more clear.
There’s an even stronger argument against EDT+SSA: That it can be diachronically dutch-booked. See Conitzer (2017). (H/t Anthony DiGiovanni for the link.)
I find this satisfying, since it more cleanly justifies that EDT shouldn’t be combined with any empirical updating whatsoever. (Not sure what’s the situation with logical updates.)
(The update that Paul suggests in a parallel comment, “exclude worlds where your current decision doesn’t have any effects”, would of course still work — but it transparently doesn’t serve any decision-relevant purpose and doesn’t seem philosophically appealing either, to me.)
”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there’s no reason why they should get such a big boost from being paired with humans.
Hm.
R² = 1 − (mean squared errors / variance)
Mean squared error seems pretty principled. Normalizing by variance to make it more comparable to other distributions seems pretty principled.
I guess after that it seems more natural to take the standard deviation (to get RMSE normalized by standard deviation), than to subtract it off of 1. But I guess the latter is a simple enough transformation and makes it comparable to the (more well-motivated) R^2 for linear models, so therefore more commonly reported than RMSE/STD.
Anyway, spearman r is −0.903 (square 0.82) and −0.710, (square 0.5) so basically the same.