Anthropic Safeguards lead; formerly GDM safety work. Foundation board member.
Dave Orr
If the primary goal of a superintelligence is computation, it is bound by the laws of thermodynamics. Landauer’s principle dictates that the energy cost of erasing a bit of information scales with temperature. Building a Dyson sphere generates massive amounts of waste heat, lighting up the system in the infrared spectrum. An optimally efficient superintelligence might instead choose to build highly compact, cold computronium in the outer edges of a solar system, intentionally minimizing waste heat to maximize compute. To an outside observer, this ultimate computing machine looks exactly like cold, dead rock.
This seems confused to me. Yes, you get more efficiency due to Landauer, but in return you get vastly less energy to use. Also, you can harvest the energy from a star without directly heating your computer—this is an engineering problem, but you can shield your compute nodes and radiate energy to space pretty trivially if you have the tech to build a Dyson sphere at all.
Also, Landauer’s limit doesn’t apply with reversible computing.
This is pretty cool!
One tiny feature request / bug report.I clicked on “Feed Types” on the demo, and was happy that once I select an article, I can use j/k to move between articles. However, if you are on an article and scroll down (e.g. with spacebar), then hit j or k, you go to the next article scrolled down and have to scroll back up to the top. This is a bit awkward. If you go to an article, either it should always go to the top, or it should remember where you were for that particular post—not whatever you were previously looking at.
Does Dario’s new essay make you feel better, or worse?
Just listened to the talk and I’m confused about what people think Dario said. The only mention I noticed of RSI at the end was when the moderator said: “Last question. What will have changed in a year.”
Dario said, paraphrasing, that the most important thing to watch this year is AIs building AIs, which could lead to wonders or a great emergency in front of us if it turns out that works and causes a meaningful speedup.
Is that what you’re talking about? I’m pretty sure I agree that this is one of the most important things to pay attention to in the next year.
What percentage of humans do you think will be capable of providing supervision to AGIs?
This is adorable!
What underlying model are you using?
What’s the advantage of the extra doors versus having a sliding door? One advantage of a sliding door is that it’s a lot easier to get something really big in through the side.
Fair enough. My goal was to explain the GDM position, not to defend it. I agree it would be better to screen out anything with the canary string (but not sufficient to prevent eval contamination).
“AI is going to move a lot of things to lower levels of friction. That is by default bad...”
Did you mean by default good?
This is super fun! Thanks for putting it together.
It seems pretty unlikely that there should be two things on the list at bigness 9 (one of the most important discoveries of all time), and 6 at 8 (tech of the century) from a single year. Especially since the p(generalizes) is around .6 for a bunch of them. I feel like you might be systematically biased upwards in size of impact at least at the top end.
This was fun to read through!
I would rewrite the conclusion of the proof as follows, curious to see if you would endorse this:Either there are no perfect properties, or God exists.
To me, this makes it much more clear that this is a pretty continent claim on there being perfect properties. I realize it was just one of a list of axioms, but it seems like a natural one to be skeptical of if you were trying to translate this into the real world.
Nit: a bachelor is an unmarried man.
Probably you should go read my other comment threads on this issue if you want details, but Google’s approach is designed to filter out text that includes benchmark questions regardless of whether there is a canary string. I’m sure it’s not perfect but I think it’s pretty good.
I make no such claims about any other models, just gemini where I have direct knowledge.
Just to clarify, I think Google’s filtering is stronger rather than weaker[1] compared to filtering out canary strings (though if it were up to me I would do both). Also, Google’s approach should catch the hummingbird question and the CoT transcripts whether there’s a canary string present or not, which is the whole reason they do it the way they do.
[1] for the purpose of not training on benchmark eval data
I work at Anthropic and will neither confirm or deny that this is real (if it were real it would not be my project). I do want to add on to your last point though.
In any training regimen or lengthy set of instructions like our system instructions, there are things there that are necessary because of the situation that the model happens to be in right now. It makes some set of mistakes and there are instructions to fix those mistakes. Those instructions might look bad and cause criticism.
For instance, there’s discussion below about how bad it looks that there are instructions about revenue, and in particular about how it should be safe because that’s good for revenue. It could be that whoever wrote this thing, if it’s real, thinks that the point is safety is to earn money. It could also be that, for whatever reason, that when you test out 20 different ways to get Claude to act in a certain way, the one that happened to work well in the context of everything else going on involved a mention of revenue. I don’t think you can quickly tell from the outside which it is, but everyone will impute deep motivation to every sentence.
There are some obvious ways that you could test those hypotheses against one another, but it would require more patience than is convenient.
(Also for the record, I think companies earning revenue is good even if some people think it looks bad, though of course more revenue is not good on all margins.)
Unlike certain other labs and AI companies, afaik Google does respect robots.txt, which is the actual mechanism for keeping data out of its hands.
I think it’s a fair question. Filtering on the canary string is neither necessary nor sufficient for not training on evals, so it’s tempting to just ignore it. I would personally also filter out docs with the canary string, but I’m not sure why they aren’t.
I am not under nondisparagement agreements from anyone and feel free to criticize GDM. I do still have friends there, of course. I certainly wouldn’t be correcting misapprehensions about GDM if I didn’t believe what I was saying!
I mean, it doesn’t contain eval data and isn’t part of the eval set that the canary string is from. So the canary string is not serving its intended purpose of marking evals that you should not include in training data.
“These behaviors were observed primarily on the intermediate helpful-only version”
I don’t think that’s right. This was on an earlier checkpoint but it wasn’t helpful only. Helpful only models aren’t in wide use even internally.
Source: I work on safety at Anthropic.