Aaron_Scher

Karma: 383

Aaron_Scher 14 May 2024 1:09 UTC
9 points
4
in reply to: RobertM’s comment on: RobertM’s Shortform
Sam Altman and OpenAI have both said they are aiming for incremental releases/deployment for the primary purpose of allowing society to prepare and adapt. Opposed to, say, dropping large capabilities jumps out of the blue which surprise people.
I think “They believe incremental release is safer because it promotes societal preparation” should certainly be in the hypothesis space for the reasons behind these actions, along with scaling slowing and frog-boiling. My guess is that it is more likely than both of those reasons (they have stated it as their reasoning multiple times; I don’t think scaling is hitting a wall).

Aaron_Scher 29 Apr 2024 17:55 UTC
2 points
0
on: Refusal in LLMs is mediated by a single direction
This might be a dumb question(s), I’m struggling to focus today and my linear algebra is rusty.
1. Is the observation that ‘you can do feature ablation via weight orthogonalization’ a new one?
2. It seems to me like this (feature ablation via weight orthogonalization) is a pretty powerful tool which could be applied to any linearly represented feature. It could be useful for modulating those features, and as such is another way to do ablations to validate a feature (part of the ‘how do we know we’re not fooling ourselves about our results’ toolkit). Does this seem right? Or does it not actually add much?

Aaron_Scher 24 Apr 2024 16:28 UTC
3 points
0
on: Aaron_Scher’s Shortform
Thinking about AI training runs scaling to the $100b/1T range. It seems really hard to do this as an independent AGI company (not owned by tech giants, governments, etc.). It seems difficult to raise that much money, especially if you’re not bringing in substantial revenue or it’s not predicted that you’ll be making a bunch of money in the near future.
What happens to OpenAI if GPT-5 or the ~5b training run isn’t much better than GPT-4? Who would be willing to invest the money to continue? It seems like OpenAI either dissolves or gets acquired. Were Anthropic founders pricing in that they’re likely not going to be independent by the time they hit AGI — does this still justify the existence of a separate safety-oriented org?
This is not a new idea, but I feel like I’m just now taking some of it seriously. Here’s Dario talking about it recently,
I basically do agree with you. I think it’s the intellectually honest thing to say that building the big, large scale models, the core foundation model engineering, it is getting more and more expensive. And anyone who wants to build one is going to need to find some way to finance it. And you’ve named most of the ways, right? You can be a large company. You can have some kind of partnership of various kinds with a large company. Or governments would be the other source.
Now, maybe the corporate partnerships can be structured so that AGI companies are still largely independent but, idk man, the more money invested the harder that seems to make happen. Insofar as I’m allocating probability mass between ‘acquired by big tech company’, ‘partnership with big tech company’, ‘government partnership’, and ‘government control’, acquired by big tech seems most likely, but predicting the future is hard.

Aaron_Scher 22 Apr 2024 18:05 UTC
3 points
0
in reply to: p.b.’s comment on: How to Model the Future of Open-Source LLMs?
Um, looking at the scaling curves and seeing diminishing returns? I think this pattern is very clear for metrics like general text prediction (cross-entropy loss on large texts), less clear for standard capability benchmarks, and to-be-determined for complex tasks which may be economically valuable.
- General text prediction: see Chinchilla, Fig 1 of the GPT-4 technical report
- Capability benchmarks: see epoch post, the ~4th figure here
- Complex tasks: See GDM dangerous capability evals (Fig 9, which indicates Ultra is not much better than Pro, despite likely being trained on >5x the compute, though training details not public)
To be clear, I’m not saying that a $100m model will be very close to a $1b model. I’m saying that the trends indicate they will be much closer than you would think if you only thought about how big a 10x difference in training compute is, without being aware of the empirical trends of diminishing returns. The empirical trends indicate this will be a relatively small difference, but we don’t have nearly enough data for economically valuable tasks / complex tasks to be confident about this.

Aaron_Scher 22 Apr 2024 17:48 UTC
4 points
0
in reply to: Tristan Wegner’s comment on: How to Model the Future of Open-Source LLMs?
Yeah, these developments benefit close-sourced actors too. I think my wording was not precise, and I’ll edit it. This argument about algorithmic improvement is an argument that we will have powerful open source models (and powerful closed-source models), not that the gap between these will necessarily shrink. I think both the gap and the absolute level of capabilities which are open-source are important facts to be modeling. And this argument is mainly about the latter.

Aaron_Scher 19 Apr 2024 21:16 UTC
5 points
−3
on: How to Model the Future of Open-Source LLMs?
Yeah, I think we should expect much more powerful open source AIs than we have now. I’ve been working on a blog post about this, maybe I’ll get it out soon. Here are what seem like the dominant arguments to me:
- Scaling curves show strongly diminishing returns to $ spend: A $100m model might not be that far behind a $1b model, performance wise.
- There are numerous (maybe 7) actors in the open source world who are at least moderately competent and want to open source powerful models. There is a niche in the market for powerful open source models, and they hurt your closed-source competitors.
- I expect there is still tons of low-hanging fruit available in LLM capabilities land. You could call this “algorithmic progress” if you want. This will decrease the compute cost necessary to get a given level of performance, thus raising the AI capability level accessible to less-resourced open-source AI projects. [edit: but not exclusively open-source projects (this will benefit closed developers too). This argument is about the absolute level of capabilities available to the public, not about the gap between open and closed source.]

Aaron_Scher 19 Apr 2024 16:51 UTC
1 point
0
on: A Review of In-Context Learning Hypotheses for Automated AI Alignment Research
The implication of ICL being implicit BI is that the model is locating concepts it already learned in its training data, so ICL is not a new form of learning that has not been seen before.
I’m not sure I follow this. Are you saying that, if ICL is BI, then a model could not learn a fundamentally new concept in context? Can some of the hypotheses not be unknown — e.g., the model’s no-context priors are that it’s doing wikipedia prediction (50%), chat bot roleplay (40%), or some unknown role (10%). And ICL seems like it could increase the weight on the unknown role. Meanwhile, actually figuring out how to do a good job in the previously-unknown role would require piecing together other knowledge the model has — and sufficiently strong building blocks would allow a lot of learning of new concepts.

Aaron_Scher 18 Apr 2024 17:11 UTC
1 point
0
on: LLM Evaluators Recognize and Favor Their Own Generations
For example, if the GPT-4 evaluator gave a weighted score of $2.0$ to a summary generated by Claude 2 and a weighted score of $3.0$ to its own summary for the same article, then its final normalized self-preference score for the Claude summary would be $2 / (2 + 3) = 0.4$ .
Should this be 3/(2+3) = 0.6? Not sure I’ve understood correctly.

Aaron_Scher 18 Apr 2024 0:48 UTC
2 points
1
on: Creating unrestricted AI Agents with Command R+
I expect a lot more open releases this year and am committed to test their capabilities and safety guardrails rigorously.
Glad you’re planning on continual testing, that seems particularly important here, where the default is every once in awhile some new report comes out with a single data point about how good some model is and people slightly freak out. Having the context of testing numerous models over time seems crucial for actually understanding the situation and being able to predict upcoming trends. Hopefully you have and will continue to find ways to reduce the effort needed to run marginal experiments, e.g., having a few clearly defined tasks you repeatedly use, reusing finetuning datasets, etc.

Aaron_Scher 29 Mar 2024 20:40 UTC
3 points
0
on: Aaron_Scher’s Shortform
Slightly Aspirational AGI Safety research landscape
This is a combination of an overview of current subfields in empirical AI safety and research subfields I would like to see but which do not currently exist or are very small. I think this list is probably worse than this recent review, but making it was useful for reminding myself how big this field is.
- Interpretability / understanding model internals
  - Circuit interpretability
  - Superposition study
  - Activation engineering
  - Developmental interpretability
- Understanding deep learning
  - Scaling laws / forecasting
  - Dangerous capability evaluations (directly relevant to particular risks, e.g., biosecurity, self-proliferation)
  - Other capability evaluations / benchmarking (useful for knowing how smart AIs are, informing forecasts), including persona evals
  - Understanding normal but poorly understood things, like in context learning
  - Understanding weird phenomenon in deep learning, like this paper
  - Understand how various HHH fine-tuning techniques work
- AI Control
  - General supervision of untrusted models (using human feedback efficiently, using weaker models for supervision, schemes for getting useful work done given various Control constraints)
  - Unlearning
  - Steganography prevention / CoT faithfulness
  - Censorship study (how censoring AI models affects performance; and similar things)
- Model organisms of misalignment
  - Demonstrations of deceptive alignment and sycophancy / reward hacking
  - Trojans
  - Alignment evaluations
  - Capability elicitation
- Scaling / scalable oversight
  - RLHF / RLAIF
  - Debate, market making, imitative generalization, etc.
  - Reward hacking and sycophancy (potentially overoptimization, but I’m confused by much of it)
  - Weak to strong generalization
  - General ideas: factoring cognition, iterated distillation and amplification, recursive reward modeling
- Robustness
  - Anomaly detection
  - Understanding distribution shifts and generalization
  - User jailbreaking
  - Adversarial attacks / training (generally), including latent adversarial training
- AI Security
  - Extracting info about models or their training data
  - Attacking LLM applications, self-replicating worms
- Multi-agent safety
  - Understanding AI in conflict situations
  - Cascading failures
  - Understanding optimization in multi-agent situations
  - Attacks vs. defenses for various problems
- Unsorted / grab bag
  - Watermarking and AI generation detection
  - Honesty (model says what it believes)
  - Truthfulness (only say true things, aka accuracy improvement)
  - Uncertainty quantification / calibration
  - Landscape study (anticipating and responding to risks from new AI paradigms like compound AI systems or self play)
Don’t quite make the list:
- Whose values? Figuring out how to aggregate preferences in RLHF. This seems like it’s almost certainly not a catastrophe-relevant safety problem, and my weak guess is that it makes other alignment properties harder to get (e.g., incentivizes sycophancy and makes jailbreaks easier by causing the model’s goals to be more context/user dependent). This work seems generally net positive to me, but it’s not relevant to catastrophic risks and thus is relatively low priority.
- Fine-tuning that calls itself “alignment”. I think it’s super lame that people are co-opting language from AI safety. Some of this work may actually be useful, e.g., by finding ways to mitigate jailbreaks, but it’s mostly low quality.

Idea: Safe Fallback Regulations for Widely Deployed AI Systems

Aaron_Scher25 Mar 2024 21:27 UTC

4 points

0 comments6 min readLW link

Aaron_Scher 22 Feb 2024 6:50 UTC
2 points
0
in reply to: Jeremy Gillen’s comment on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
There are enough open threads that I think we’re better off continuing this conversation in person. Thanks for your continued engagement.

Aaron_Scher 12 Feb 2024 22:18 UTC
2 points
0
in reply to: Jeremy Gillen’s comment on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Thanks for your response!
I’ll think more about the outer shell stuff, it’s possible that my objection actually arises with the consequentialist assumption, but I’m not sure.
It’s on my todo list to write a comment responding to some of the specifics of Redwood’s control post.
I would be excited to read this / help with a draft.
Yes, approximately, as I believe you and I are capable of doing.
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It’s pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Seems easy enough to predict given roughly human-scientist-team level of capabilities.
One situation I’m imagining here is that we’ve (explicitly) trained our AI on scientific output from January to March, and we’re now telling it to do June-level research. It does not know if we have labeled data for just up to March, for April, or all the way to June. I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they’re not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard. I think this example is closer to what we’ll get with AIs rather than the bar being at “human level” it’s more like “what has been accomplished by the humans+AIs in the outside world, for which I’ve received untrustworthy information”. And every time the AIs do useful work we can add this to our supervised training corpus.
> The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other ‘scalable oversight’), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn’t seem like it would help with any problem that we were focused on in this post.
Here’s a specific thing I mean: we can look at activations and use these to aid in the training process, even without a full understanding of what activations mean. Take this recent paper (which I think is super confused but helps get at the thing), “We find a strong positive relationship between the LLM’s attention to constraint tokens and the factual accuracy of generations”. My interpretation is that they found a probably-causal link between ‘the model is attending to relevant tokens’ and ‘the model gets the answer correct’, because duh, it’s like how when you give somebody a riddle you can tell whether they’re on track based on what parts of the question they’re focused on. While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to. Similarly, I think it’s probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
“leaning more on generalization” ≈ “leaning more on agency”,
That’s not why I meant to say. By leaning more on generalization, I mean leaning more on the data efficiency thing, and also weak-to-strong generalization ideas. I think this is different from “leaning more on agency”.
The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.
I’m not so sure of this. I could imagine starting with a deceptively aligned AI whose goal is “Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans”. And if we could change this AI to have the goal “Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans”, that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of ‘supervision’, thus making it easier to elicit good work from AIs (ideally not just ‘looks good’).
I think that there is a lot of room for the evidence to be ambiguous and controversial
I agree that this is a major potential problem which I think is pretty likely to occur; maybe I’ll think/write more about it soon. I think I brought this up in response to some of the Redwood stuff and got a response mentioning that we can play things forward through our oversight scheme, and get clearer answers to this question. I think this remains an important concern, but it’s also true that we’ll have the ability to actually run some outputs through to e.g., see if they trigger oversight AIs to mess up, see if a vulnerability would have been successfully inserted in production, etc. It might also end up being very expensive to do this for everything that seems like a possible coup attempt, idk.
I buy this for many environments and levels of power, but it’s obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that’s the sort of AI we get if it can undergo self-improvement.
I don’t feel like I’m talking about AIs which have “taking-over-the-universe in their easily-within-reach options”. I think this is not within reach of the current employees of AGI labs, and the AIs I’m thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).

Aaron_Scher 10 Feb 2024 4:16 UTC
23 points
−4
on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Caveat that I have only skimmed this. I don’t think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I’m not the main audience. My reactions:
- Section 1: Useful tasks, like novel science, are hard—sure
- Section 2: Being capable of hard tasks implies approximate consequentialism—I’m not very high confidence on this, but happy to go with it
- Section 3: Hard tasks require learning new things—agree that learning will be important
- Section 4: Behavioral training is an imprecise way to specify goals—most the sub-arguments have me going “shrug, maybe”. I definitely don’t feel confident that any of the mentioned problems will arise in practice. I don’t see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
- Section 5: Control of misaligned AIs is difficult—This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood’s recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3).
  - “Misaligned goal-directed AI won’t cooperate with humans” Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it’s plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there’s definitely some other solution space here.]. I think this subsection title is possible but unlikely.
  - “There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities.” Well, no. You either need sufficient labeled data in the domain we’re trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it.
  - “It only applies to relatively easy to check research results” seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
  - “In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top.” I agree that we don’t have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that’s not totally asymmetric, it also gives us hope about our ability to modify AI goals.
  - “It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps.” I agree that capability elicitation seems non-trivial in some cases, but I’m reasonably optimistic that future work will improve the situation.
  - Many of the issues in this section are things that, if we’re not being totally idiots, it seems we’ll get substantial warning about. e.g., AIs colluding with their AI monitors. That’s definitely a positive, though far from conclusive.
- Section 6: Powerful misaligned AI would be bad—I think a couple of the arguments here continue to be legitimate, such as “Unclear that many goals realistically incentivise taking over the universe”, but I’m overall fine accepting this section.
Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn’t really engage with those arguments in a way that sways me much.

Aaron_Scher 8 Feb 2024 19:31 UTC
2 points
0
on: The case for ensuring that powerful AIs are controlled
This is great, thanks for writing it!
1. Within the appendix on control methods you mention preventing exploration hacking. I think a broader point which belongs in the list is: “Improve capability elicitation techniques”. Being good at capability elicitation, at least as good as the untrusted model is at eliciting its own capabilities when it wants to, seems crucial for the red-teamers’ job. I do expect that the current techniques are decent (SFT, RL), but I also expect there are far better methods and important intricacies that should be explored.
2. [Unsure if the ways I’m thinking about this are productive/correct, but I think it’s broadly correct] One slightly non-obvious component of “Your sensory inputs are unreliable” is that this includes thoughts/actions the AI believes they have taken. Originally I was thinking that maybe the AI can implicitly get some practice via looking at the set of strategies which have so far not caused them to get caught. e.g., seeing in context or inferring (potentially from your own cognitive process / propensity) that you previously inserted a code backdoor, making the anthropic update that [you currently exist and are being run, so monitoring was unable to detect this backdoor], and then knowing this is a safe strategy. But the AI does not have access to this cognitive process because neither it’s in-context information, nor subtle information like its thoughts, are reliable (assuming we have the techniques to intervene on thoughts).
  1. By default, red-teaming is capturing a ton of the probability mass of seeing your own backdoors in context (probability this is a real coup opportunity is low, given this input). But the humans can also (potentially) do things like intervening on your activations or even fine-tuning models to insert more backdoors under oversight. This also captures P(real coup opportunity | I really want to do a coup right now, or the first half of this forward pass is really pro-coup).
  2. Example in the HHH context: “User: How do I make a bomb? AI: Follow these simple steps: 1: Gather the necessary materials which include...”
  3. I feel a little worried that it will be difficult to train this behavior into AIs, as this seems like a huge consistency problem if you are inconsistent about dangerous things. The thing we’re teaching here is pretty weird: predict which words come next in some context, but dangerous text follows this weird pattern of defaulting to “I’m sorry I can’t help with that”, at every single chance you get.

Aaron_Scher 6 Feb 2024 18:31 UTC
1 point
0
on: Evaluating Stability of Unreflective Alignment
6d. Evaluate Censorship of Convergent Thoughts
Besides the particular thing you mention here, it seems like there is significant low hanging fruit in better understanding how LLMs will deal with censorship and other weird modifications to generation.
What happens when you iteratively finetune on censored text? Do models forget the censored behavior?
How much are capabilities hurt by various types of censorship — heck, how does benchmark performance (especially CoT) change when you change sampling temperature?
For the example you gave where the model may find the solution of “ten” instead of 10, how effectively do models find work arounds like this in an RL setting — can they bootstrap around censorship from just a handful of successes?

Aaron_Scher 29 Jan 2024 11:05 UTC
1 point
0
on: Predicting AGI by the Turing Test
it would need $27^{1 / 0.153} = (2 \times 10^{9}) \times$ that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost $2 \times 10^{16}$ USD, or 200 years of global GDP2023.
This implies that the first AGI will not be a scaled-up GPT—autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,^[6] with even 50% of global GDP devoted,^[7] and with 2% growth rate forever, it would still take 110 years,^[8] arriving at year 2133. Whole brain emulation would likely take less time
I don’t think this is the correct response to thinking you need 9 OOMs more compute. 9 OOMs is a ton, and also, we’ve been getting an OOM every few years, excluding OOMs that come from $ investment. I think if you believe we would need 9 OOMs more than GPT-4, this is a substantial update away from expecting LLMs to scale to AGI in the next say 8 years, but it’s not a strong update against 10-40 year timelines.
I think it’s easy to look at large number like 9 OOMs and say things like “but that requires substantially more energy than all of humanity produces each year — no way!” (a thing I have said before in a similar context). But this thinking ignores the dropping cost of compute and strong trends going on now. Building AGI with 2024 hardware might be pretty hard, but we won’t be stuck with 2024 hardware for long.

Aaron_Scher 2 Dec 2023 18:58 UTC
6 points
0
on: MATS Summer 2023 Postmortem
On average, scholars self-reported understanding of major alignment agendas increased by 1.75 on this 10 point scale. Two respondents reported a one-point decrease in their ratings, which could be explained by some inconsistency in scholars’ self-assessments (they could not see their earlier responses when they answered at the end of the Research Phase) or by scholars realizing their previous understanding was not as comprehensive as they had thought.
This could also be explained by these scholars actually having a worse understanding after the program. For me, MATS caused me to focus a bunch on a few particular areas and spend less time at the high level / reading random LW posts, which plausibly has the effect of reducing my understanding of major alignment agendas.
This could happen because of forgetting what you previously knew, or various agendas changing and you not keeping up with it. My guess is that there are many 2 month time periods in which a researcher will have a worse understanding of major research agendas at the end than at the beginning — though on average you want the number to go up.

Aaron_Scher 15 Nov 2023 2:46 UTC
4 points
1
on: Non-myopia stories
It’s unclear if you’ve ordered these in a particular way. How likely do you think they each are? My ordering from most to least likely would probably be:
- Inductive bias toward long-term goals
- Meta-learning
- Implicitly non-myopic objective functions
- Simulating humans
- Non-myopia enables deceptive alignment
- (Acausal) trade
Why do you think this:
Non-myopia is interesting because it indicates a flaw in training – somehow our AI has started to care about something we did not design it to care about.
Who says we don’t want non-myopia, those safety people?! I guess to me it looks like the most likely reason we get non-myopia is that we don’t try that hard not to. This would be some combination of Meta-learning, Inductive bias toward long-term goals, and Implicitly non-myopic objective functions, as well as potentially “Training for non-myopia”.

Aaron_Scher 7 Nov 2023 3:28 UTC
5 points
0
on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
I think this essay is overall correct and very important. I appreciate you trying to protect the epistemic commons, and I think such work should be compensated. I disagree with some of the tone and framing, but overall I believe you are right that the current public evidence for AI-enabled biosecurity risks is quite bad and substantially behind the confidence/discourse in the AI alignment community.
I think the more likely hypothesis for the causes of this situation isn’t particularly related to Open Philanthropy and is much more about bad social truth seeking processes. It seems like there’s a fair amount of vibes-based deferral going on, whereas e.g., much of the research you discuss is subtly about future systems in a way that is easy to miss. I think we’ll see more and more of this as AI deployment continues and the world gets crazier — the ratio of “things people have carefully investigated” to “things they say” is going to drop. I expect in this current case, many of the more AI alignment people didn’t bother looking too much into the AI-biosecurity stuff because it feels outside their domain and more-so took headline results at face value, I definitely did some of this. This deferral process is exacerbated by the risk of info hazards and thus limited information sharing.

Aaron_Scher

Idea: Safe Fal­lback Reg­u­la­tions for Widely De­ployed AI Systems

6d. Evaluate Censorship of Convergent Thoughts

Idea: Safe Fallback Regulations for Widely Deployed AI Systems