Alt account. Don’t follow me on twitter. Do follow me on substack.
GenericModel
My default reaction to this post is pretty strongly negative, because I feel that it doesn’t adequately engage with the opposing point-of-view, and why people who work on evaluations may think that their work is impactful and good. You are not required to do that before posting your opinion, but I think if you had tried slightly more to pass the ITT of people who work on evaluations, this post would read quite differently and perhaps more convincingly.
In any case, to object to some specific points:
Regulations do exist. I don’t understand the objection to evaluations being that they rely on regulations that don’t exist. It may be that you think the regulations are inadequate, but it is simply not the case that “regulations don’t exist.” Work by evaluators was, in fact, often quite instrumental in getting these regulations passed and taken seriously and implemented, as far as I’m aware.
Some examples: the EU AI act, New York’s RAISE act, Texas’ Responsible AI Governance act, California’s SB 53. South Korea has passed an AI Framework Act. The situation in China is harder to know about in detail. Even in the absence of codified “AI regulation,” for example, in the UK, existing regulators often can apply pre-existing laws to AI – even if these don’t cover the harms we are most concerned about, they are helpful on the margin, and further progress is being made here (led by UK AISI!)
I’m more sympathetic to your second objection, but I think this mostly depends on the jurisdiction. In the common-law frameworks with which I am most familiar, the burden of proof simply is on the public to make the case to legislators to pass AI regulation. Otherwise it is unregulated, and thus there are no legal requirements specific to AI. So yes, if you want to get AI regulated, you need to convince decision-makers that AI has to be regulated. Naturally, I agree with you that we ought to be significantly more conservative in the development of AI than we currently are, but nevertheless if you want to get regulations, you need to pass them (often in the face of overwhelming lobbying efforts against such regulations by AI and tech firms).
It is hard to draw an analogy here – perhaps a good one would be drug discovery in the US. Before the FDA existed and put the burden of proof on those bringing drugs to market, there were no regulations on selling drugs, people could sell whatever they wanted to. Over time, in the early-to-mid 20th century, more regulations were brought in. Eventually in 1962, the government gave the FDA much more power, and forced drug manufacturers to prove both efficacy and safety before bringing a drug to market, however the only reason it was able to establish an organization with such teeth and put the onus on drug manufacturers was due to public outcry over a specific disastrous case and because the groundwork had already been laid by prior regulations empowering the USDA. We’re trying to speedrun this process with AI, and evaluations are going to form a part of any sufficient regulatory framework. Perhaps you would say that there is no way for us to prove AI systems are safe and we need to pause, and I would be deeply sympathetic. Unfortunately, I’m pessimistic about the prospects of such a suggestion, and so I would rather lay the groundwork for an effective and powerful regulatory regime.
There are two more specific points that I would take exception to:
1) That Goodfire is well-described as “a startup leveraging interpretability for capabilities.” I think this is a part of what Goodfire does – and I wish they weren’t doing quite as much of that as they are – though this is certainly not the full story, and I think this is an unfair descriptor as currently written (i.e. it would lead someone unfamiliar with Goodfire to have false beliefs about them).
2) The criticism of the revolving door culture. This is often the case in regulatory setups (e.g. the SEC, the Treasury, and Wall Street). This makes sense! You want good economists and people who actually understand financial systems to be the ones regulating them, and this is inevitably going to lead to something of a revolving-door culture. It’s not optimal, but this is the world we live in. Expertise is unfortunately rare. I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
Sure, but Ukraine wanted F-35s right, I assume because they thought they would be useful. As to the rest, it seems like you could claim that America ‘seized’ Japanese territory after a nuclear strike (rewriting the Japanese constitution, occupation, now a staunch ally, etc). Such a strike only has to break the will of the people fighting, or break the ability of command structures to function effectively, you don’t have to glass the entire country you want to invade.
Am I missing something? It seems like defense surpasses offense in the conventional sense. But if Russia and Ukraine both had nuclear weapons, no drone would be able to prevent that once launched, right? Likewise if I flew an airplane to drop a bunker-buster on a dam (e.g. F-35s are still useful in the Russia/Ukraine war).
Maybe in the limiting case of drone-warfare defense does dominate, but it seems to me that we are some time away from that, in manufacturing capacity alone, if nothing else.
I think Benacerraf’s epistemological argument just is a knockdown against almost any form of platonism to me. You don’t have to agree because I think at this level of metaphysics there’s a lot of intuitions dominating, but for me it makes platonism entirely implausible. I’m not too decided on ontological status right now, maybe I’m some sort of eliminative structuralist or something? But I could be moved. Obviously when I do maths I act as though I’m a platonist and when I talk about morality I act as though I’m a realist, but in both cases I am not.
But yes, I think probably most of the debates between e.g. Woodin, Hamkins, etc about pluralism/non-pluralism are for the most part better thought of as methodological debates with a philosophical undercurrent.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
Ah okay, I think I understand, if I’m remembering my type theory correctly. I think this is downstream of “standard type theory” i.e. type theory created by Löf not accepting the excluded middle? Which does also mean rejecting choice, for sure.
EDIT: But fwiw, I think the excluded middle is much less controversial than Choice (it should technically be strictly less controversial). I think that may be a less interesting post, but I’m sure philosophers have already written that. Though I think a post defending rejecting the excluded middle from a type theory perspective actually could be quite good, because lots of people don’t seem to understand the arguments from the other side here, and think they’re just being ridiculous.
I think I basically agree that this is how one should consider this.
But I think there is a reasonable defence of the “ZF-universe as somewhat transcendent entities” and that is that we do virtually all of our actual maths in ZF-universes, by saying that ultimately we will be able to appeal to ZF-axioms. This makes ZF-objects pretty different from groups. E.g. I think there’s a pretty tight analogy between forcing in ZF and galois extensions (just forcing is much more complicated), but the consequences of forcing for how we do the rest of maths can be somewhat deep (e.g. CH is doomed in ZFC, consequences about Turing computability, etc). So the mystical reputation is somewhat deserved. Woodin would defend some much more complicated and involved version of this as I discussed in my post about the constructible universe.
But I agree ultimately, with our current understanding of ZF-universes that they are just another mathematical object, they just happen to be an object that we use to do other maths with, and we can step outside those objects these days with large cardinal axioms, if we’d like, and analyze other consequences of them. It’s pretty similar to how we started viewing logic and logics after Gödel’s results (i.e. clearly first-order logic is very useful because of compactness/completeness, but that doesn’t mean “Second order logic is wrong.”)
Ah I really need to write a megasequence about the large cardinal axioms! They’re awesome (I wrote a thesis on them).
If you’re talking about surreals or hyperreals, the issue is basically that there’s not one canonical model of infinitesimals, you can create them in many different ways. I’ll hopefully end up writing more about the surreals and hyperreals at some point, but they don’t solve as many issues as you’d hope unfortunately, and actually introduce some other problems.
As a motivating idea here, note that you need the Boolean Prime Ideal Theorem (which requires a weak form of choice to prove) to show that the hyperreals even exist in the first place, if you’re starting from the natural numbers as “mathematically/ontologically basic.” (maybe there’s another way to define them but none immediately come to mind, there is another way to define the surreals, but there are other issues there).
Yes sorry to be clear I’m not talking about whether it is true, I am talking about whether they would use it or not in proving ‘standard’ results.
Oh, yeah sure. I mean I think as a matter of pragmatics it mostly is an exercise in foundations these days. But I agree that splitting up concepts of finitude—for example—is a super interesting investigation. Just, like, the majority of algebraic geometers, functional analysts, algebraic topologists, analytic number theorists, algebraic number theorists, galois theorists, representation theorists, differential geometers, etc etc etc, would not be all that interested in such an investigation these days.
Sorry, I think explaining without using type theory what you are trying to say may help me understand better?
EDIT: like, in particular, insofar as its relevant to the axiom of choice.
I think among working mathematicians it is that much of a minority. E.g. it is much less controversial than something like many-worlds among physicists, or something like heritability of intelligence among geneticists. It is broadly incredibly accepted.
EDIT: But I do agree it’s a respected alternative view, and I do think it should stay that way because it is interesting to investigate. I just think people get the wrong idea about its degree of controversy among working mathematicians.
I think I must’ve not paid enough attention in type theory class to get this? Is this an excluded middle thing? (if it’s a joke that I’m ruining by asking this feel free to let me know)
Well put! I guess if I can define a function from problems to math-to-model-it, then for every problem I can pick out the right math-to-model-it?
Or, indeed, perhaps not? ;)
Yeah that’s probably right. But then there’s introducing this weird distinction between “I can do it for any x” and “I can do it for all x.”
It quickly becomes pretty philosophical at that point, about whether you think there’s a distinction there or not. I guess my claim in this post is more like “working mathematicians in fields outside of foundations have collectively agreed on an answer to this philosophical puzzle, and that answer is actually quite defensible.”
Ah yeah, this is a Hamel basis version of Diaconescu’s theorem (a very cool theorem)! Lovely proof!
About Goodfire
So probably this cashes out as more fundamental generator disagreements that aren’t worth hashing out here. Broadly I think it’s okay for a company to say “We are designing and advancing the next generation of AI systems” and I think to analyze whether they should be bucketed with capabilities labs like OpenAI or Anthropic (which I think are also meaningfully different places), one should look at and critically assess their research output.
Like, if someone believes that interpretability with be both helpful to build better systems and helpful to build safer systems, I think it’s justifiable for them to do the thing that builds better systems in the hopes that those systems are also safer than the next-best-thing that would’ve been built (and that’s probably reliant on a bunch of other beliefs where we may differ as I said before).
About revolving doors
Sorry I may have been unclear. What I meant is that [A] is good and [B] is bad. I am criticizing [B] here which is playing the associations-game. I think that is generally bad and you should not do it.
I think it describes some of what you are doing though not all of it, yes.
I am not saying one should reject it as evidence. It’s fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly. It is, in my opinion, better to apply criticism to actual actions you disagree with, or strategic choices, or whatever, instead of who people were roommates with or not. Otherwise you get to games like what happened with your point about Goodfire/Apollo where we have: Apollo is suspect because it was co-founded by two people who then went to Goodfire and anyone who works at Goodfire is suspect because in some way they think what they’re working on will be useful for AI development (even though I’m pretty sure those people continue to see their work as being heavily motivated by safety, and differentially useful for safety). I think this chain of inferences is bad, and so we should cut it at the root. Most professional criticism should be about what people have actually done, not about who they’ve been associated with.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem? Or is it a PR problem? Or just a feature of the space? And I think for answering these questions, it is better to look at behaviour (which again, you did do as well as I mentioned originally, I just think the association-game stuff isn’t great).