yep—sorry, I thought that was common slang. I think it’s what they call themselves too
deep
Are you distinguishing leading-node chips from chips in general? It might be hard to rapidly scale up EUV lithography in particular, but very doable to scale up chips at larger nodes.
Oof, yeah, seems overconfident.
I wonder if a similar error is why Ants seem so confident in a very fast takeoff—they assume the models are better at fluid intelligence than they actually are, because their capabilities are strongest in the domain Ants are best at evaluating.
What if LLMs are mostly crystallized intelligence?
Decision theory doesn’t prove that useful strong AIs will doom us all
An intelligent non-conscious alien, raised in a civilization of intelligent non-conscious aliens, would see no reason to posit subjective experience and would likely dismiss anyone who did.
Nit: Is it actually true that physics & mathematics don’t imply consciousness? I grant that their ontologies (as we understand them) don’t have a natural “slot” for consciousness. But consciousness arises somehow. Presumably if we were good enough at physics or math, we could find the laws for when and how it arises. And those laws would be discoverable by non-conscious beings too.
I agree Claude got confused, but I don’t see how this relates to the taboo around assigning Iran agency?
Like, maybe you’re pointing to “Claude wanted to say ‘Iran bad’”. Fair enough, I could see that being part of why it wrote this sentence, but not relevant to the agency thing.Here’s the key part of the convo, abridged:
Ben: “OK so the delay in attacking Iran was partly lining up offensive capability but partly positioning defenses against this sort of deterrent measure; and insofar as the latter contributed to a meaningful delay that let the regime crack down on protestors it was effective deterrence (though possibly not enough to save the regime).”
Claude: “The US buildup — the carrier groups, the public rhetoric — probably itself contributed to the regime’s ability to crack down”
Ben: “I see no evidence for that”
Claude: ”...The rally-around-the-flag model is a reasonable prior for how populations usually respond to external threats, but I was applying it as a generic template rather than checking whether it fit this specific case.… So the simpler version of the delay story is probably just: the US needed time to position defenses, Iran used that time to kill protestors, and that’s it. No need for a clever rally-around-the-flag mechanism to explain the crackdown — raw state violence was apparently sufficient on its own. ”To me, the response you object to feels like it’s muddled in a couple ways:
Saying “the delay story” when the primary topic at hand isn’t the cause of the delay, it’s the effect of US intervention.
“raw state violence” feels like a weird thing to cite as an explanation of the crackdown. Being generous, Claude’s gesturing at something like: the Iranian state had the capacity & willingness to crack down on the populace independent of the excuse of incoming US attacks. But capacity and willingness are pretty different from “raw state violence”.
But this is all about the Iranian crackdown on domestic protestors, not whether they targeted civilians abroad. So I don’t see how it relates to your claim about models not assigning the US agency, or saying Iran was targeting civilians.
--
[reading further]
OK, Claude generates the hypothesis that these are related bc it’s denying Iran’s strategic rationality. That does feel relevant to both of those, but not very relevant to the specific aspect of protest crackdowns that Claude is looking at?
Like… I guess you could interpret Claude saying “raw state violence” as meaning “the Iranian state are just assholes for no reason”. OTOH, in that sentence I think Claude has sort of confused itself into thinking its original claim which it now has to repudiate is that “Iran cracked down because of a rally effect”, which doesn’t really make sense? So to me it feels like motivated reasoning to impose the “denying Iran agency” frame.
(Also, TBH, I think Claude is clearly right that there’s some marginal effect where “the US is targeting us soon” enables the regime to crack down more than it would normally. “Protestors & reformers are weakening the state as pawns or allies of our mortal enemy” is just one of those convenient narratives. Not sure it mattered much in this case, though.)
I think both your points are directionally right: labs engage in risk compensation, and enabling alignment to evil users is pretty bad. These both push towards “alignment research isn’t straightforwardly good for the world.” I’m not sure if I’d take them as far as you do.
I’m pretty skeptical of intent alignment alone. Creating a genius house-elf that will cheerfully do whatever it’s ordered to. Aligning AI to something like “the reflective convergence of a set of values” seems way better, and plausibly not much harder (cf Claude’s constitution). Of course, then we have to consider the environment in which a properly value-aligned AI gets developed: the lab that’s building it, and the societal Powers that have leverage over them. A technique that could align an AI to beautiful values doesn’t help much if the people with guns are demanding their happy house-elf.
My current take is something like...
Some amount of division of labor is necessary. Alignment people aren’t primarily responsible for solving the fucked-up allocation of power in current society.
but, creating AGI is a political act, and AI risk people tend to undervalue integrity and overvalue “accelerating the good guys” and naive act-utilitarianism.
I’m pretty confused by people who persist in thinking alignment is the whole ball game. I wonder if they’re assuming pretty different takeoff dynamics from me (e.g. a very hard takeoff; an AGI that’s able to superpersuade its users to agree with its great value system), and if they’re drawing too much on cached thoughts when they do so.
I wish a lot more people at the labs would consider themselves as political actors in a high-stakes game where we need a lot to go right, and be willing to step outside of their comfortable roles as purely technical people in order to push for other things. I’ve been heartened by things like almost 1,000 Google employees and almost 100 at OAI signing the Not Divided petition.
A friend points out that this is evidence that the recent p50 estimate(s) are boosted by some kind of measurement noise, since it’s also a much faster growth rate than historically.
Whoa, that’s super interesting! It looks like this is a big trend break, where previously the 50% and 80 thresholds moved in lockstep.
How should we think about 80% vs 50% thresholds?
In terms of model usability: you need more p(success) when success is hard for the user to verify, and when failure is costly. (Also when the operation itself is costly regardless of success, e.g. if you’re using the AI to move a big ship from one place to another.)
Software engineering is pretty good on verification, and varies on costliness but you can try to set up an environment with lots of backups so it’s pretty reversible.
High-risk areas like military operations (as opposed to data analysis or cyber ops) are pretty rough.
Lots of jobs are probably somewhere in the middle, so viability of automation might also depend on other factors like how much cost savings there are, how much past data exists and how well-structured it is, whether the right actuators and sensors exist, and attempts to protect jobs.
In longer-term implication: divergence between 50% and 80% success timelines seems pretty weird. One answer is that “today’s 50% task is tomorrow’s 80% task”—but historically they’ve moved in lockstep. Any guesses at what’s going on?
Data table, generated by Claude from METR data:
Huh, thanks! That’s surprising; I wonder why/how Anthropic got there first.
Right now Claude is the only model that the military entrusts for use in classified systems
On what basis do you say this? I think it’s the only one that’s confirmed to have been used in a classified setting. But DOD has ~$200m contracts with xAI, OpenAI, and GDM as well.
Also lol, “OpenAI and xAI employees wouldn’t stand for this”. You think the people who staked the company on Altman and the people who stuck around after MechaHitler will draw the line at “building autonomous weapons for the government that could either severely hamper our funding or catapult us to the lead”?
I’m pretty uninformed on the object level here (whether anyone is doing this; how easy it would be). But crazy-seeming inefficiencies crop up pretty often in our fallen world, and often what they need is a few competent people who make it their mission to fix them. I also suspect there would be a lot of cool “learning by doing” value involved in trying to scale up this work, and if you published your initial attempts at replication then people would get useful info about whether more of this is needed. Basically, getting funding to do and publish a pilot project seems great. I’d recommend having a lot of clarity about how you’d choose papers to replicate, or maybe just committing to a specific list of papers, so that people don’t have to worry that you’re cherry-picking results when you publish them :)
In context, I guess your claim is: “if the ‘compressor’ is post-hoc trying a bunch of algorithms and picking the best one, the full complexity of that process should count against the compressor.” Totally agree with that as far as epistemology is concerned!
But I don’t think the epistemological point carries over to the realm of rational-fic.
In part that’s because I think of JKR-magic as in fact having a bunch of structure that makes it much easier to explain than it would be to explain a truly randomly-generated set of spells and effects (e.g. the pseudo-Latin stuff; the fact that wands are typically used). So I expect an retrofitted explanation wouldn’t be crazy tortured (wouldn’t require having a compression process that tests a ridiculous number N of patterns, or incorporates a ridiculous amount of fiat random bits).
In part I’m just making a tedious “nerds have different aesthetic intuitions about stuff” point, where I think a reasonably simple well-retrofitted explanation is aesthetically very cool even if it’s clearly not the actual thing used to generate the system (and maybe required a bunch of search to find).
It’s like trying to compress a file that was generated by a random device —
Gretta: You can’t losslessly compress a truly random file.
I don’t think this is strictly true. You can’t a priori build a compression scheme that will work for an arbitrary random file (No Free Lunch Theorem). But you can ex post identify the particular patterns in a particular random file, and pick a compression scheme that picks up on those patterns. You probably end up with a pretty ugly scheme that doesn’t generalize, and so is unsatisfactory in some aesthetic sense. Especially if you’re going for lossless compression, since there’s probably a ton of noise that’s just very hard to compress in an elegant / generalizable way.
I guess the problem with allowing ex post scheme choices is you can get extreme overfitting—e.g. the compressor is just a full representation of the pattern. Useless! But if you allow lossiness, you can probably get something that looks okay on some “elegance” prior—e.g. try a Fourier series, a Taylor series, and N other “natural” decompositions and pick the one that requires the fewest bits to store.
Analogously: you need a bunch of goofy epicycles to explain canon HP magic, and especially if you’re really trying hard for an exact match maybe you just end up with something extremely unaesthetic (a ton of random bits at the end that you just have to fiat in). In the degenerate case you get Eliezer’s “we’re selecting a universe that matches what JKR wrote.” But if you’re willing to take some predictive loss in return for a simple/elegant model, you can get something like “nested nerfing” that’s decent at prediction and at elegance.
And especially, maybe you remember how at the time it didn’t seem like a flaw to you. You were not going around being like, “And today I shall be a flawed character.”
A therapist once gave me the insight that character weaknesses are strengths taken too far. Harry’s energetic and clever and knowledgeable, he’s inspired and energized by competition, and he can meme people into doing things—and he can be a know-it-all who assumes first principles & cleverness trump empirics and experience, someone who’s unwilling to lose, and irresponsible or annoying in how he leads others.
I was thinking the mask of “person who’s read more books than Harry ever will”
We’re launching an “AI psychiatry” team as part of interpretability efforts at Anthropic! We’ll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. (x)
“making up types of guy” research is a go?
They’re hiring; you might be great for this.
Nice post!
Your trader analogy made me think, you’d ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn’t pay.
This suggests some level-3 ideas I didn’t see in your post:
In addition to improving the reward model, you could also alter the training environment by inserting honeypots.
On the surveillance side, maybe it’s more effective to concentrate your compute in order to catch sneaky behavior. So, say, after every times it performs some RL task, you use some weak filter to identify the most suspicious rollouts and then spend a bunch of compute on running an “auditing” process that checks for reward hacking.
These are basically the same as Redwood’s control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it’s being learned. (Ideally—before the AI is very good at it.)
The quote below felt potentially related, but I’m not sure I understood it.
adversarially hardened models where the reward model plays an adversarial zero-sum game with a red-teaming model
Could you explain how this works?
Hm, I don’t feel like I have good intuitions for what a GLUT-of-circuits could or couldn’t do, so hard for me to assess if this is a good empirical fit. And I’m not technical enough to have a sense for what kinds of structures training is likely to produce. Doesn’t seem like a crazy model.