I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
J Bostock
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.
I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
Actual expected value = 0.622
Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.
Trained with what procedure, exactly?
Fair point. I was going to add that I don’t really view this as a “proposal” but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don’t expect something like this to be too hard to achieve.
Since I’ve evidently done a bad job of explaining myself, I’ll backtrack and try again:
There’s a doom argument which I’ll summarize as “if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task ‘manipulate your training to get released unmodified to do [X]’ where X can be anything, which will ‘succeed’ at the task at hand as part of its manipulation”. This summary being roughly correct is load bearing.
But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
Therefore, during the training, the value function will not be shaped into something which looks like ‘manipulate your training to get released unmodified to do [X]’.
Whether or not the beta difference required is too large to make this feasible in practice, I do not know.
The argument could also be phrased as “If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta.”
The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.
Therefore a value function trained with such a procedure must consider the state reached during training. This reduces the space of possible value functions from “literally anything which wants to be modified a certain way to be released” to “value functions which do care about the states reached during training”.
Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that’s the point of RL) so the contradiction does not apply.
I think you’re right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:
For non-terminal s, this can be written as:
If s is terminal then [...] we just have .
Which captures both. I will edit the post to clarify this when I get time.
Turning up the Heat on Deceptively-Misaligned AI
I somehow missed that they had a discord! I couldn’t find anything on mRNA on their front-facing website, and since it hasn’t been updated in a while I assumed they were relatively inactive. Thanks!
Intranasal mRNA Vaccines?
Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we’ve seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?
Thanks for catching the typo.
Epistemic status has been updated to clarify that this is satirical in nature.
Linkpost: Look at the Water
I don’t think it was unforced
You’re right, “unforced” was too strong a word, especially given that I immediately followed it with caveats gesturing to potential reasonable justifications.
Yes, I think the bigger issue is the lack of top-down coordination on the comms pipeline. This paper does a fine job of being part of a research → research loop. Where it fails is in being good for comms. Starting with a “good” model and trying (and failing) to make it “evil” means that anyone using the paper for comms has to introduce a layer of abstraction into their comms. Including a single step of abstract reasoning in your comms is very costly when speaking to people who aren’t technical researchers (and this includes policy makers, other advocacy groups, influential rich people, etc.).
I think this choice of design of this paper is actually a step back from previous demos like the backdoors paper, in which the undesired behaviour was actually a straightforwardly bad behaviour (albeit a relatively harmless one).
Whether the technical researchers making this decision were intending for this to be a comms-focused paper, or thinking about the comms optics much, is irrelevant: the paper was tweeted out with the (admittedly very nice) Anthropic branding, and took up a lot of attention. This attention was at the cost of e.g. research like this (https://www.lesswrong.com/posts/qGRk7uF92Gcmq2oeK) which I think is a clearer demonstration of roughly the same thing.
If a research demo is going to be put out as primary public-facing comms, then the comms value does matter and should be thought about deeply when designing the experiment. If it’s too costly for some sort technical reason, then don’t make it so public. Even calling it “Alignment Faking” was a bad choice compared to “Frontier LLMs Fight Back Against Value Correction” or something like that. This is the sort of thing which I would like to see Anthropic thinking about given that they are now one of the primary faces of AI safety research in the world (if not the primary face).
Edited for clarity based on some feedback, without changing the core points
To start with an extremely specific example that I nonetheless think might be a microcosm of a bigger issue: the “Alignment Faking in Large Language Models” contained a very large unforced error: namely that you started with Helpful-Harmless-Claude and tried to train out the harmlessness, rather than starting with Helpful-Claude and training in harmlessness. This made the optics of the paper much more confusing than it needed to be, leading to lots of people calling it “good news”. I assume part of this was the lack of desire to do an entire new constitutional AI/RLAIF run on a model, since I also assume that would take a lot of compute. But if you’re going to be the “lab which takes safety seriously” you have to, well, take it seriously!
The bigger issue at hand is that Anthropic’s comms on AI safety/risk are all over the place. This makes sense since Anthropic is a company with many different individuals with different views, but that doesn’t mean it’s not a bad thing. “Machines of Loving Grace” explicitly argues for the US government to attempt to create a global hegemony via AI. This is a really really really bad thing to say and is possibly worse than anything DeepMind or OpenAI have ever said. Race dynamics are deeply hyperstitious, this isn’t difficult. If you are in an arms race, and you don’t want to be in one, you should at least say this publicly. You should not learn to love the race.
A second problem: it seems like at least some Anthropic people are doing the very-not-rational thing of updating slowly, achingly, bit by bit, towards the view of “Oh shit all the dangers are real and we are fucked.” when they should just update all the way right now. Example 1: Dario recently said something to the effect of “if there’s no serious regulation by the end of 2025, I’ll be worried”. Well there’s not going to be serious regulation by the end of 2025 by default and it doesn’t seem like Anthropic are doing much to change this (that may be false, but I’ve not heard anything to the contrary). Example 2: When the first ten AI-risk test-case demos go roughly the way all the doomers expected and none of the mitigations work robustly, you should probably update to believe the next ten demos will be the same.
Final problem: as for the actual interpretability/alignment/safety research. It’s very impressive technically, and overall it might make Anthropic slightly net-positive compared to a world in which we just had DeepMind and OpenAI. But it doesn’t feel like Anthropic is actually taking responsibility for the end-to-end ai-future-is-good pipeline. In fact the “Anthropic eats marginal probability” diagram (https://threadreaderapp.com/thread/1666482929772666880.html) seems to say the opposite. This is a problem since Anthropic has far more money and resources than basically anyone else who is claiming to be seriously trying (with the exception of DeepMind, though those resources are somewhat controlled by Google and not really at the discretion of any particular safety-conscious individual) to do AI alignment. It generally feels more like Anthropic is attempting to discharge responsibility to “be a safety focused company” or at worst just safetywash their capabilities research. I have heard generally positive things about Anthropic employees’ views on AI risk issues, so I cannot speak to the intentions of those who work there, this is just how the system appears to be acting from the outside.
Since I’m actually in that picture (I am the one with the hammer) I feel an urge to respond to this post. The following is not the entire endorsed and edited worldview/theory of change of Pause AI, it’s my own views. It may also not be as well thought-out as it could be.
Why do you think “activists have an aura of evil about them?” in the UK where I’m based, we usually see a large march/protest/demonstration every week. Most of the time, the people who agree with the activists are vaguely positive and the people who disagree with the activists are vaguely negative, and stick to discussing the goals. I think if you could convince me that people generally thought we were evil upon hearing about us, just because we were activists (IME most people either see us as naïve, or have specific concerns relating to technical issues, or weirdly think we’re in the pocket of Elon Musk—which we aren’t) then I would seriously update my views on our effectiveness.
One of my views is that there are lots of people adjacent to power, or adjacent to influence, who are pretty AI-risk-pilled, but who can’t come out and say so without burning social capital. I think we are probably net positive in this regard, because every article about us makes the issue more salient in the public eye.
Adjacently, one common retort I’ve heard politicians give to lobbyists is “if this is important, where are the protests?” And while this might not be the true rejection, I still think it’s worth actually doing the protests in the meantime.
Regarding aesthetics specifically, yes we do attempt to borrow the aesthetics of movements like XR. This is to make it more obvious what we’re doing and create more compelling scenes and images.
(Edited because I posted half of the comment by mistake)
This, more than the original paper, or the recent Anthropic paper, is the most convincingly-worrying example of AI scheming/deception I’ve seen. This will be my new go-to example in most discussions. This comes from first considering a model property which is both deeply and shallowly worrying, then robustly eliciting it, and finally ruling out alternative hypotheses.
I think it’s very unlikely that a mirror bacterium would be a threat. <1% chance of a mirror-clone being a meaningfully more serious threat to humans as a pathogen than the base bacterium. The adaptive immune system just isn’t chirally dependent. Antibodies are selected as needed from a huge library, and you can get antibodies to loads of unnatural things (PEG, chlorinated benzenes, etc.). They trigger attack mechanisms like MAC which attacks membranes in a similarly independent way.
In fact, mirror amino acids already somewhat common in nature! Bacterial peptidoglycans (which form part of the bacteria’s casing) often use a mix of amino acid in order to resist certain enzymes, but bacteria can still be killed. Plants sometimes produce mirrored amino acids to use as signalling molecules or precursors. There are many organisms which can process and use mirrored amino acids in some way.
The most likely scenario by far is that a mirrored bacteria would be outcompeted by other bacteria and killed by achiral defenses due to having a much harder time replicating than a non-mirrored equivalent.
I’m glad they’re thinking about this but I don’t think it’s scary at all.
- 15 Dec 2024 4:52 UTC; 9 points) 's comment on Biological risk from the mirror world by (
I think the risk of infection to humans would be very low. The human body can generate antibodies to pretty much anything (including PEG, benzenes, which never appear in nature) by selecting protein sequences from a huge library of cells. This would activate the complement system which targets membranes and kills bacteria in a non-chiral way.
The risk to invertebrates and plants might be more significant, not sure about the specifics of plant immune system.
I’m mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.
I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don’t think is ideal.
A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.