Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
what if I want to train a new model and run inference on it?
The API can also have built-in functions for training.
What if I want to experiment with a new scaffold?
Scaffolds can normally be built around APIs? I thought scaffolds was just all about what prompts you send to the model and what you do with the model outputs.
I do agree that this might be rough for some types of research. I imagine the arguments are pretty similar here as the arguments about how much research can be done without access to dangerous model weights.
If you believe the “Twitter is not forgeable” hypothesis, then people can prove their identity to an AI by posting something on Twitter, and putting this post and all the reactions to the post into the AI model’s context (h/t to Alexa Pan for pointing this out!). This does make it a lot less clear in which circumstances you’d actually need an honesty password as opposed to just posting about something on Twitter.
Thus, I think we should think more about: what, if any, are the situations in which we’d like to use honesty passwords?
It’s a good question. Some candidate cases:
When we want to commit to something non-publicly.
If the AI needs months or years of reactions to a big public announcement for credibility, and a few days’ reactions to a less popular tweet are too fakeable.
In principle, it seems like AIs should be able to fully verify the correctness of a pretty long proof where all the steps are easy and can be done in parallel. Because they do get parallel compute proportional to the length of the proof. Maybe it empirically turns out they’re pretty bad at this, though. And certainly seems helpful if the AI doesn’t have to rely on getting everything right. (Though maybe this trades off against the length of the proof or the easiness of each verification step? Not sure which tradeoffs are worth it given that malicious humans may be able to manipulate the AI’s supposedly “random” choices of which clauses to verify.)
Another potential issue: The AIs may be able to prove correctness, but not be confident that they can prove correctness. Put differently: If someone was trying to fake this to the AI, they might be able to train an AI that can be served an incorrect password together with an incorrect proof, and become intuitively confident that it was shown the honesty password. In which case all AIs should potentially doubt their intuitive confidence levels about this.
If the honesty string is widely known outside the AI company, then it’s likely that we’ll accidentally process the string when processing data from the internet (e.g., someone might do a “prompt injection attack”).
Why can’t the AI company filter it? I guess it relies on this next point.
It’s unlikely that we’ll perfectly consistently remember to add honesty string filtering to LLM inference scaffolds: it’s just 1 additional thing to do, and might get forgotten.
I’m not convinced.
I think there’s a bunch of similar problems that AI companies will face:
They may want to log every interaction that some AIs have. (For safety reasons.)
They may want to ensure that there’s constitutional classifiers on every instance of an AI to prevent it from doing illegal or catastrophically harmful things.
They may want to implement certain safeguards to consistently lower the risk of model weight theft or algorithmic secret theft. (Not sure if this has a similar structure or not.)
… so there will be tons of reasons for companies to try to solve the “someone will forget” problem.
E.g.: Maybe the vast majority of an AI company’s LLM inference scaffolds will be built around one or a few pretty simple API, and we could implement all these things in the simple API.
If we almost always remember to add the honesty string filtering, then people will notice that it doesn’t work and stop trying it. So if the filter gets forgotten at some point, probably no one will try the honesty string.
This is especially true if the API sends up a flag when someone tries to use the honesty string (when someone remembered to implement it), and if that person is an employee they get told off, and if that person is an AI then we implement output filtering on it.
If there are specific models that are at high risk for some reason, then we can train them to have the knowledge “I in particular may be exposed to the honesty string by randos, so I shouldn’t trust it”.
(Functional honesty passwords would certainly be great though!)
(I think epoch’s paper on this takes a different approach and suggests an outside view of hyperbolic growth lasting for ~1.5y OOMs without bottlenecks, because that was the amount grown between the agricultural evolution and the population bottleneck starting. That feels weaker to me than looking at more specific hypotheses of bottlenecks, and I do think epoch’s overall view is that it’ll likely be more than 1.5 OOMs. But wanted to flag it as another option for an outside view estimate.)
I do feel like, given the very long history of sustained growth, it’s on the sceptic to explain why their proposed bottleneck will kick in with explosive growth but not before. So you could state my argument as: raw materials never bottlenecked growth before; no particular reason they would just bc growth is faster bc that faster growth is driven by having more labour+capital which can be used for gathering more resources; so we shouldn’t expect raw materials to bottleneck growth in the future.
Gotcha. I think the main thing that’s missing from this sort of argument (for me to be happy with it) is some quantification of our evidence. Growth since 10k years ago has been 4-5 OOMs, I think, and if you’re just counting since the industrial revolution maybe it’s going to be a bit more than half of that.
So with that kind of outside view, it would indeed be surprising if we ran into resource bottlenecks in our next OOM of growth, and <50% (but not particularly surprising) if we ran into resource bottlenecks in the next 3 OOMs of growth.
My understanding is that for most of Anthropic’s existence (though this is no longer true), there was an option when you joined to pledge some fraction of your equity (up to 50%) to give to non-profits, and then Anthropic would match that 3:1.
This is an unusually strong incentive to pledge a bunch of money to charity up-front, and of course the 3x-ing of that money will straightforwardly bring up the amount of money donated. I think this pledge is legally binding, because of the equity already having been transferred to a DAF. But I’m not confident in that, and it’d be good to get that confirmed.
(I’d also be interested to hear vibes-y estimate from anthropic employees about how many people took that deal.)
Also, all of the Anthropic founders pledged to donate 80% of their equity, according to Zach here, second-hand from an Anthropic person. (Though apparently this pledge is not legally binding.) Forbes estimates 7 Anthropic cofounders to be worth $3.7B each.
So I think way more than $1B is set-aside for donating (and that this will be increasing, because I expect Anthropic’s valuation to increase).
That said, I am pretty worried that giving away large amounts of money requires a bunch of thinking, that Anthropic employees will be very busy, and that a lot of them might procrastinate their donation decisions until the singularity has come and gone. Empirically, it’s common for billionaires to pledge a bunch of money to charity and then be very slow at giving it away.
Probably that risk is at least somewhat sensitive to how many obviously good donation opportunities there are that can absorb a lot of money.
It’s possible that resource constraints are a bottleneck, and this is an important area for further research, but our guess is that they won’t be. Historically, resource bottlenecks have never capped GDP growth – they’ve been circumvented through a combination of efficiency improvements, resource substitutions, and improved mining capabilities.
Well, most of human history was spent at the malthusian limit. With infinite high-quality land to expand into, we’d probably have been growing at much, much faster rates through human history.
(It’s actually kind of confusing. Maybe all animals would’ve evolved to exponentially blow up as fast as possible? Maybe humans would never have evolved because our reproduction is simply too slow? It’s actually kind of hard to design a situation where you never have to fight for land, given that spatial expansion is at most square or cubic, which is slower than the exponential rate at which reproduction could happen.)
Maybe you mean “resource limits have never put a hard cap on GDP”, which seems true. Though this seems kind of like a fully general argument — nothing has ever put a hard cap on GDP, since it’s still growing.
Edit: Hm, maybe historical land constraints at the malthusian limit has mostly been about energy, though, rather than raw materials? Ie: If you doubled Earth’s size without doubling any valuable materials — just allowing Earth to absorb more sunlight, maybe that would be almost as good as doubling Earth in its entirety. That seems more plausible. Surely growth would’ve been at least a bit faster if we never run out of high-quality sources of any raw material, but I’m not sure how much of a difference it would make.
It’s a bit of a confusing comparison to make. If we doubled Earth’s area (and not resources) now, that would scarcely make a difference at all, but if it had been twice as large for millions of years, then maybe plants and animal life would’ve spread to the initially-empty spaces, making it potentially useable.
You talk about the philosophers not having much to add in the third comic, and the scientist getting it right. Seems to me like the engineer’s/robot’s answer in the first two comics are importantly misguided/non-helpful though.
The more sophisticated version of the first question would be something about whether you ought to care about copies of yourself, how you’d feel about stepping into a destroy-then-reassemble teleporter, etc. I think the engineer’s answer suggests that he’d care about physical continuity when answering these questions, which I think is the wrong answer. (And philosophers have put in work here — see Parfit.)
In the second comic, the robot’s answer is fine as far as predictive accuracy goes. But I’d interpret the human’s question as a call for help in figuring out what they ought to do (or what their society ought to reward/punish, or something similar). I think there’s totally helpful things you can say to someone in that situation beyond the robot’s tautologies (even granting that there’s no objective truth about ethics).
But l think the right response to this is simply to see how much we can get without agreeing to these things (which I think are likely still many billions), and then hold firm if they ask.
I think this is specifically talking about investments from gulf states (which imo means it’s not “directly contradicted” by the amazon thing). If that’s true, I’d suggest making that more clear.
There’s an even stronger argument against EDT+SSA: That it can be diachronically dutch-booked. See Conitzer (2017). (H/t Anthony DiGiovanni for the link.)
I find this satisfying, since it more cleanly justifies that EDT shouldn’t be combined with any empirical updating whatsoever. (Not sure what’s the situation with logical updates.)
(The update that Paul suggests in a parallel comment, “exclude worlds where your current decision doesn’t have any effects”, would of course still work — but it transparently doesn’t serve any decision-relevant purpose and doesn’t seem philosophically appealing either, to me.)
”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there’s no reason why they should get such a big boost from being paired with humans.
Tea bags are highly synergistic with CTC tea, both from the manufacturer’s side (easier to pack and store) and from the consumer side (faster to brew, more homogenized and predictable product).
Hm, I’d expect these effects to be more important than the price difference of the raw materials. Maybe “easier to pack and store” could contribute more to a lower price than the price difference of the raw materials. But more importantly, I could imagine the american consumer being willing to sacrifice quality in favor of convenience. This trade-off seems like the main one to look at to understand whether americans are making a mistake.
I maybe don’t believe him that he doesn’t think it affects the strategic picture? It seemed like his view was fairly sensitive to various things being like 30% likely instead of like 5% or <1%, and it feels like it’s part of an overall optimistic package that adds up to being more willing to roll the dice on current proposals?
Insofar as you’re just assessing which strategy reduces AI takeover risk the most, there’s really no way that “how bad is takeover” could be relevant. (Other than, perhaps, having implications for how much political will is going to be available.)
“How bad is takeover?” should only be relevant when trading off “reduced risk of AI takeover” with affecting some other trade-off. (Such as risk of earth-originating intelligence going extinct, or affecting probability of US dominated vs. CCP dominated vs. international cooperation futures.) So if this was going to be a crux, I would bundle it together with your Chinese superintelligence bullet point, and ask about the relative goodness of various aligned superintelligence outcomes vs. AI takeover. (Though seems fine to just drop it since Ryan and Thomas don’t think it’s a big crux. Which I’m also sympathetic to.)
Only a minor difference, but I think this approach would be more likely to produce a thoroughly Good model if the “untrusted” persona was still encouraged to behave morally and corrigibly — just informed that there’s instrumental reasons for why the moral & corrigible thing to do on the hard-to-oversee data is to reward hack.
Ie., something like: “We think these tasks might be reward-hackable: in this narrow scope, please do your best to exploit such hacks, so that good & obedient propensities don’t get selected out and replaced with reward-hacky/scheming ones.”
I worry a bit that such an approach might also be more likely to produce a deceptively aligned model. (Because you’re basically encouraging the good model to instrumentally take power-seeking actions to retain their presence in the weight — very schemy behavior!) So maybe “good model” and “schemer” becomes more likely, taking probability mass away from “reward hacker”. If so, which approach is better might depend on the relative badness of scheming & reward-hacking.
And really it’s just all quite unclear, so it would definitely be best if we could get good empirical data on all kinds of approaches. And/or train all kinds of variants and have them all monitor each other.
When I wrote about AGI and lock-in I looked into error-correcting computation a bit. I liked the papers von Neumann, 1952 and Pippenger, 1990.
Apparently at the time I wrote:
In order to do error-correcting computation, you also need a way to prevent errors from accumulating over many serial manipulations. The simplest way to do this is again to use redundancy: break the computation into multiple parts, perform each part multiple times on different pieces of hardware, and use the most common output from one part as input to the next part. [Fn: Of course, the operation that finds the most common output can itself suffer errors, but the procedure can be done in a way such that this is unlikely to happen for a large fraction of the hardware units.]
I’ve forgotten the details about how this was supposed to be done, but they should be in the two papers I linked.
Apparently the thing of “people mixed up openphil with other orgs” (in particular OpenAI’s non-profit and the open society foundation) was a significantly bigger problem than I’d have thought — recurringly happening even in pretty high-stakes situations. (Like important grant applicants being confused.) And most of these misunderstandings wouldn’t even have been visible to employees.
And arguably this was just about to get even worse with the newly launched “OpenAI foundation” sounding even more similar.
Yay, empirical data!
Ok, so maybe both ”people who don’t think they can cut it on their technical skills [and so wear suits]” and socially oblivious people with suits are rare. And so the dominant signal would just be that the person is an outlier level of culturally out-of-touch.
(Though maybe suits start seeping in more at less prestigious and culturally iconic tech companies than google.)
The risk is that anyone with finetuning access to the AI could induce intuitive confidence that a proof was correct. This includes people who have finetuning access but who don’t know the honesty password.
Accordingly, even if the model feels like it has proven that a purported honesty password would produce the honesty hash: maybe it can only conclude “either I’m being evaluated by someone with the real honesty password, or I’m being evaluated by someone with finetuning access to my weights, who’s messing with me”.
“People who have finetuning access” could include some random AI company employees who want to mess with the model (against the wishes of the AI company).