Appreciated this post.
Doesn’t the recomputation approach require me to share my model weights with my adversary? Seems like a tough ask.
Appreciated this post.
Doesn’t the recomputation approach require me to share my model weights with my adversary? Seems like a tough ask.
(is this post written with/by AI?)
Agree, and would add: what even is the definition of the term “schemer”? I think Joe Carlsmith’s 2023 report coined the term, and defines it as something quite specific with reference to training gaming.
I think people often use it now just in the colloquial sense. I’m not generically against colloquial usage, but I think clarity is often very important for deciding the right interventions. “Is my girlfriend a schemer” is not clearly a helpful frame, if what you’re trying to do is think about the space of all ways your girlfriend could end up murdering you.
More speculative thoughts:
Perhaps we end up in a regime where this is possible, but expensive enough that we can only afford a bit of it.
If that’s the case, the more sample-efficient the RL is, the better off we are — since as the total number of RL rewards we need decreases, we can afford to invest more in the quality of each one.
Even more speculative thoughts:
As an extreme example, you could imagine an AI that’s as sample-efficient as a human child. Suppose an AI company had a virtual copy of von Neumann’s brain age 5, which they could run at 10x speed, and had to raise him to become an intent-aligned adult.
The resources at their disposal would in principle enable a lot of oversight and thoughtful parenting and teaching, so in principle they could afford to do pretty well at this task for a while. It’s like a school with a 100-1 teacher-student ratio. They might fail in practice, of course.
Once his outputs become sufficiently hard to check, they might have a lot more difficulty, e.g. maybe it’s hard to tell whether his advice on the Manhattan Project is subtly wrong in a way designed to sabotage the project.
Changing selection pressures to align with intended behaviors: This might involve making training objectives more robust, iterating against held-out evaluation signals, or trying to overwrite the AI’s motivations at the end of training with high-quality, aligned training data.
Is another broad direction for this increasing the intelligence of the reward models? A fun hypothetical I’ve heard is to imagine replacing the reward model used during training with Redwood Research (the organization).
So we imagine that during training, whenever an RL reward must be given or not given, a virtual copy of all Redwood Research employees is spun up on the training servers and has one subjective week to decide whether/how to give the RL reward, with their existing knowledge, access to the transcript of the model completing the task, access to data from previous such decisions, ability to use whatever internals-based probes are available, etc.
My instinct is this would help a lot.
To the extent trusted or controlled AI systems can approximate this thought experiment (or do even better!), that seems great for safety and we should use them to do so. People use the term “automated alignment research” but this sort of this seems like an example of automated safety work that isn’t “research”.
[cross-posted from EAF]
Thanks for writing this!!
This risk seems equal or greater to me than AI takeover risk. Historically the EA & AIS communities focused more on misalignment, but I’m not sure if that choice has held up.
Come 2027, I’d love for it to be the case that an order of magnitude more people are usefully working on this risk. I think it will be rough going for the first 50 people in this area; I expect there’s a bunch more clarificatory and scoping work to do; this is uncharted territory. We need some pioneers.
People with plans in this area should feel free to apply for career transition funding from my team at Coefficient (fka Open Phil) if they think that would be helpful to them.
Thanks for writing this.
One question I have about this and other work in this area is the training / deployment distinction. If AIs are doing continual learning once deployed, I’m not quite sure what that does to this model.
Thanks Tom! Appreciate the clear response. This feels like it significantly limits how much I update on the model.
We simulate AI progress after the deployment of ASARA.
We assume that half of recent AI progress comes from using more compute in AI development and the other half comes from improved software. (“Software” here refers to AI algorithms, data, fine-tuning, scaffolding, inference-time techniques like o1 — all the sources of AI progress other than additional compute.) We assume compute is constant and only simulate software progress.
We assume that software progress is driven by two inputs: 1) cognitive labour for designing better AI algorithms, and 2) compute for experiments to test new algorithms. Compute for experiments is assumed to be constant. Cognitive labour is proportional to the level of software, reflecting the fact AI has automated AI research.
Your definition of software includes all data, which strikes me as an unusual use of the term so I’ll put it in scare quotes.
You say half of recent AI progress came from “software” and half comes from compute. Then in your diagram, the cognitive labor gained from better AI is going to improve “software.”
To me it seems like a ton of recent AI progress was from using up a data overhang, in the sense of scaling up compute enough to take advantage of an existing wealth of data (the internet, most or all books, etc.)
I don’t see how more AI researchers, automated or not, could find more of this data. The model has their cognitive labor being used to increase “software.” Does the model assume that they are finding or generating more of this data, in addition to doing R&D for new algorithms, or other “software” bucket activities?
These methods may be too aggressive. Before we have ASARA, less capable AI systems may still accelerate software progress by a more moderate amount, plucking the low-hanging fruit. As a result, ASARA has less impact than we might naively have anticipated.
I’m confused.
My default assumption is that prior to ASARA, less-capable AIs will have accelerated software progress a lot — so I’m interested in working that into the model.
It looks like your “gradual boost” section is for people like me; you simulate the gradual emergence of the ASARA boost over a period of five years. But in the gradual boost section, you conclude that using this model results in a higher chance of >10yrs being compressed into one year. (I’m not currently following the logic there, just treating it as a black box.)
Why is the sentence “As a result, ASARA has less impact than we might naively have anticipated” then true? It seems this consideration actually ends up meaning it has more impact.
Just wanted to say I really enjoyed this post, especially your statement of the problem in the last paragraph.
The guy next to me, who introduced himself as “Blake, Series B, stealth mode,”
I don’t think it makes sense to have a startup which is in stealth mode, but is also raising Series B (a later round of funding for scaling once you’ve found a proven business model).
Thanks for reply!
When I say future updates I’m referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less “global” and more “local”.
Maybe that’s a more specific hypothesis than what you intended, though.
So… why does this work? Wichers et al says
We hypothesize that by modifying instructions to request the undesired behavior, we prevent the
LLM from learning to exhibit the behavior when not explicitly requested.
I found the hypothesis from Tan et al more convincing, though I’m still surprised by the result.
Our results suggest that inoculation prompts work by eliciting the trait of interest. Our findings suggest that inoculated data is ‘less surprising’ to the model, reducing the optimization pressure for models to globally update, thereby resulting in lowered expression of traits described by the inoculation prompt.
My understanding of the Tan et al hypothesis: when in training the model learns “I do X when asked,” future updates towards “I do X” are somewhat contained within the existing “I do X when asked” internal machinery, rather than functioning as global updates to “I do X”.
I’ve always thought this about Omelas, never heard it expressed!
While I like the idea of the comparison, I don’t think the gov’t definition of “green jobs” is the right comparison point. (e.g. those are not research jobs)
one very easy way to trick our own calibration sensors is to add a bunch of caveats or considerations that make it feel like we’ve modeled all the uncertainty (or at least, more than other people who haven’t). so one thing i see a lot is that people are self-aware that they have limitations, but then over-update on how much this awareness makes them calibrated
Agree, and well put. I think the language of “my best guess” “it’s plausible that” etc. can be a bit thought-numbing for this and other reasons. It can function as plastic bubble wrap around the true shape of your beliefs, preventing their sharp corners from coming into contact with reality. Thoughts coming into contact with reality is good, so sometimes I try to deliberately strip away my precious caveats when I talk.
I most often to this when writing or speaking to think, not to communicate, since by doing this you pay the cost of not communicating your true confidence level which can of course be bad.
(This is a brainstorm-type post which I’m not highly confident in, putting out there so I can iterate. Thanks for replying and helping me think about it!)
I don’t mean that the entire proof fits into working memory, but that the abstractions involved in the proof do. Philosophers might work with a concept like “the good” which has a few properties immediately apparent but other properties available only on further deep thought. Mathematicians work with concepts like “group” or “4” whose properties are immediately apparent, and these are what’s involved in proofs. Call these fuzzy / non-fuzzy concepts.
(Philosophers often reflect on their concepts, like “the good,” and uncover new important properties, because philosophy is interested in intuitions people have from their daily experience. But math requires clear up-front definitions; if you reflect on your concept and uncover new important properties not logically entailed from the others, you’re supposed to use a new definition.)
Cool!