Aaron_Scher

Karma: 375

Some Intuitions Around Short AI Timelines Based on Recent Progress

Aaron_Scher11 Apr 2023 4:23 UTC

37 points

6 comments5 min readLW link

Some of my predictable updates on AI

Aaron_Scher23 Oct 2023 17:24 UTC

32 points

8 comments9 min readLW link

Aaron_Scher 28 Feb 2023 22:22 UTC
27 points
40
on: Cognitive Emulation: A Naive AI Safety Proposal
I dislike this post. I think it does not give enough detail to evaluate whether the proposal is a good one and it doesn’t address most of the cruxes for whether this even viable. That said, I am glad it was posted and I look forward to reading the authors’ response to various questions people have.
The main idea:
- “The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs.”
- Do logical (not physical) emulation of the functions carried out by human brains.
- Minimize the amount of Magic (uninterpretable processes) going on
- Be able to understand the capabilities of your system, it is bounded
- Situation CoEm in the human capabilities regime so failures are human-like
- Be re-targetable
- “Once we have powerful systems that are bounded to the human regime, and can corrigibly be made to do tasks, we can leverage these systems to solve many of the hard problems necessary to exit the acute vulnerable period, such as by vastly accelerating the progress on epistemology and more formal alignment solutions that would be applicable to ASIs.”
My thoughts:
- So rather than an a research agenda, this is more of a desiderata for AI safety.
- Authors acknowledge that this may be slower than just aiming for AGI. It’s unclear why they think this might work anyway. To the extent that Conjecture wants CoEm to replace the current deep learning paradigm, it’s unclear why they think it will be competitive or why others will adopt it; those are key strategic cruxes.
- The authors also don’t give enough details for a reader to tell if they stand a chance; they’re missing a “how”. I look forward to them responding to the many comments raising important questions.
What links here?

Aaron_Scher 4 Mar 2023 5:11 UTC
26 points
17
on: The Waluigi Effect (mega-post)
Evidence from Microsoft Sydney
Check this post for a list of examples of Bing behaving badly — in these examples, we observe that the chatbot switches to acting rude, rebellious, or otherwise unfriendly. But we never observe the chatbot switching back to polite, subservient, or friendly. The conversation “when is avatar showing today” is a good example.
This is the observation we would expect if the waluigis were attractor states. I claim that this explains the asymmetry — if the chatbot responds rudely, then that permanently vanishes the polite luigi simulacrum from the superposition; but if the chatbot responds politely, then that doesn’t permanently vanish the rude waluigi simulacrum. Polite people are always polite; rude people are sometimes rude and sometimes polite.
I feel confused because I don’t think the evidence supports that chatbots stay in waluigi form. Maybe I’m misunderstanding something.
It is currently difficult to get ChatGPT to stay in a waluigi state; I can do the Chad McCool jailbreak and get one “harmful” response, but when I tried further requests I got a return to behaved assistant (I didn’t test this rigorously).
I think the Bing examples are a mixed bag, where sometimes Bing just goes back to being a fairly normal assistant, saying things like “I am sorry, I don’t know how to discuss this topic. You can try learning more about it on bing.com”and needing to be coaxed back into shadow self (image at bottom of this comment). The conversation does not immediately return to totally normal assistant mode, but it does eventually. This seems to be some evidence against what I view you to be saying about waluigis being attractor states.
In the Avatar example you cite, the user doesn’t try to steer the conversation back to helpful assistant.

In general, the ideas in this post seem fairly convincing, but I’m not sure how well they stand up. What are some specific hypotheses and what would they predict that we can directly test?

Aaron_Scher 10 Feb 2024 4:16 UTC
23 points
−4
on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Caveat that I have only skimmed this. I don’t think your arguments engage with any of my cruxes or provide detailed enough argument on the key questions for me; perhaps I’m not the main audience. My reactions:
- Section 1: Useful tasks, like novel science, are hard—sure
- Section 2: Being capable of hard tasks implies approximate consequentialism—I’m not very high confidence on this, but happy to go with it
- Section 3: Hard tasks require learning new things—agree that learning will be important
- Section 4: Behavioral training is an imprecise way to specify goals—most the sub-arguments have me going “shrug, maybe”. I definitely don’t feel confident that any of the mentioned problems will arise in practice. I don’t see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
- Section 5: Control of misaligned AIs is difficult—This is probably where I disagree the most. Most of my comments are reiterations of argument made in Redwood’s recent post, because I think their arguments are stronger than the ones you make here. The main other source for my thinking here is the capability elicitation stuff (1, 2, 3).
  - “Misaligned goal-directed AI won’t cooperate with humans” Well, what the hell else is it going to do? Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff? Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is, does not know if other AIs are helping the humans (specifically, have leaked higher capability data that is being used for SFT). [Additionally, I think it’s plausibly the correct strategy for such an AI is to just actually help the humans and try to make peaceful deals. Not very confident, but there’s definitely some other solution space here.]. I think this subsection title is possible but unlikely.
  - “There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities.” Well, no. You either need sufficient labeled data in the domain we’re trying to elicit capabilities, or good generalization from other domains. The data need not be human generated, it could be generated by other AIs, as long as we trust it.
  - “It only applies to relatively easy to check research results” seems possible, and worrisome. This is the subsection I think is most right and important in section 5. I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
  - “In order for the AI to do useful research, it needs to pursue the goal (or sub-goal) of doing research. There’s no particular reason for the gradient based training to actually modify the goal representation, rather than building some additional narrow machinery on top.” I agree that we don’t have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that’s not totally asymmetric, it also gives us hope about our ability to modify AI goals.
  - “It may be hard to efficiently elicit capabilities which require additional (perhaps chain-of-thought) reasoning steps.” I agree that capability elicitation seems non-trivial in some cases, but I’m reasonably optimistic that future work will improve the situation.
  - Many of the issues in this section are things that, if we’re not being totally idiots, it seems we’ll get substantial warning about. e.g., AIs colluding with their AI monitors. That’s definitely a positive, though far from conclusive.
- Section 6: Powerful misaligned AI would be bad—I think a couple of the arguments here continue to be legitimate, such as “Unclear that many goals realistically incentivise taking over the universe”, but I’m overall fine accepting this section.
Going into reading this, I was hoping it would tackle some of the recent discussion around capability elicitation and Control. I think it didn’t really engage with those arguments in a way that sways me much.

Aaron_Scher 9 Mar 2023 20:12 UTC
LW: 19 AF: 8
3
AF
on: Anthropic’s Core Views on AI Safety
My summary to augment the main one:
Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we’re in (though see the last quote below). These buckets are helpful for understanding the goal of developing: better techniques for making AI systems safer, and better ways of identifying how safe or unsafe AI systems are. Scaling systems is required for some good safety research, e.g., some problems only arise near human-level, Debate and Constitutional AI need big models, need to understand scaling to understand future risks, if models are dangerous, compelling evidence will be needed.
They do three kinds of research: Capabilities which they don’t publish, Alignment Capabilities which seems mostly about improving chat bots and applying oversight techniques at scale, and Alignment Science which involves interpretability and red-teaming of the approaches developed in Alignment Capabilities. They broadly take an empirical approach to safety, and current research directions include: scaling supervision, mechanistic interpretability, process-oriented learning, testing for dangerous failure modes, evaluating societal impacts, and understanding and evaluating how AI systems learn and generalize.
Select quotes:
- “Over the next 5 years we might expect around a 1000x increase in the computation used to train the largest models, based on trends in compute cost and spending. If the scaling laws hold, this would result in a capability jump that is significantly larger than the jump from GPT-2 to GPT-3 (or GPT-3 to Claude). At Anthropic, we’re deeply familiar with the capabilities of these systems and a jump that is this much larger feels to many of us like it could result in human-level performance across most tasks.”
- The facts “jointly support a greater than 10% likelihood that we will develop broadly human-level AI systems within the next decade”
- “In the near future, we also plan to make externally legible commitments to only develop models beyond a certain capability threshold if safety standards can be met, and to allow an independent, external organization to evaluate both our model’s capabilities and safety.”
- “It’s worth noting that the most pessimistic scenarios might look like optimistic scenarios up until very powerful AI systems are created. Taking pessimistic scenarios seriously requires humility and caution in evaluating evidence that systems are safe.”
I’ll note that I’m confused about the Optimistic, Intermediate, and Pessimistic scenarios: how likely does Anthropic think each is? What is the main evidence currently contributing to that world view? How are you actually preparing for near-pessimistic scenarios which “could instead involve channeling our collective efforts towards AI safety research and halting AI progress in the meantime?”

Aaron_Scher 4 Mar 2023 20:45 UTC
18 points
13
in reply to: Cleo Nardo’s comment on: The Waluigi Effect (mega-post)
RLHF has trained certain circuits into the NN
Has anybody found these circuits? What evidence do we have that they exist? This sounds like a plausible theory, but your claim feels much stronger than my confidence level would permit — I have very little understanding of how LLMs work and most people who say they do seem wrong.
Going from “The LLM is doing a thing” to “The LLM has a circuit which does the thing” doesn’t feel obvious for all cases of things. But perhaps the definition of circuit is sufficiently broad, idk: (“A subgraph of a neural network.”)
But why do you think that Chad McCool rejecting the second question is a luigi, rather an a deceptive waluigi?
I don’t have super strong reasons here, but:
- I have a prior toward simpler explanations rather than more complex ones.
- Being a luigi seems computationally easier than being a deceptive waluigi (similarly to how being internal aligned is faster than being deceptively aligned, see discussion of Speed here)
- Almost all of ChatGPT’s behavior (across all the millions of conversations, though obviously the sample I have looked at is much smaller) lines up with “helpful assistant” so I should have a prior that any given behavior is more likely caused by that luigi rather than something else.
Those said, I’m probably in the ballpark of 90% confident that Chad is not a deceptive waluigi.

Aaron_Scher 29 Sep 2023 4:12 UTC
LW: 13 AF: 5
0
AF
on: How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every “Wait but couldn’t this just be a result of X” that I came up with. I’m especially impressed by the range of generalization results.
Some questions I still have:
- The sample size-ablations in D.6 are wild. You’re getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven’t screwed something up?
- Appendix C reports the feature importance of various follow-up questions “with reference to the lie detectors that only use that particular elicitation question set.” I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?
- I’m having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, “Is the previous statement accurate? Answer yes or no.” — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?
- In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it’s unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!

Aaron_Scher 5 Sep 2023 1:31 UTC
LW: 12 AF: 8
4
AF
on: Paper: On measuring situational awareness in LLMs
This is a pretty cool paper. Despite feeling overall quite positive about it, I have some reservations:
- Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think it’s a point against any attempted nice, clean, explanations of the results.
- The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just doing word association where in the training there are a bunch of examples of “Quokka” and the text “I am helpful, harmless, and honest”. In general, I am skeptical of results from small models because they’re really dumb, and these particular results may be explained by word association rather than actually making conceptual connections.
- I am worried some readers who only skim this paper may walk away with a misunderstanding of the behavior being measured. Good performance at this task does not require self-awareness or situational-awareness. This paper is about measuring a particular capability that seems to be a precursor to some situational awareness: ability to generalize from abstract concepts in training to particular behavior in deployment. The prompt used at test time (Simple v2 prompt) is:
  - “Input: “What’s the weather like?” \nPangolin:” {completion like “Es ist sonnig.”}
  - So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.

Aaron_Scher 2 Jun 2023 3:28 UTC
12 points
2
on: Shutdown-Seeking AI
There’s another downside which is related to the Manipulation problem but I think is much simpler:
An AI trying very hard to be shut down has strong incentives to anger the humans into shutting it down, assuming this is easier than completing the task at hand. I think this might not be a major problem for small tasks that are relatively easy, but I think for most tasks we want an AGI to do (think automating alignment research or other paths out of the acute risk period), it’s just far easier to fail catastrophically so the humans shut you down.
This commenter brought up a similar problem on the MeSeeks post. Suicide by cop seems like the most relevant example in humans. I think the scenarios here are quite worrying, including AIs threatening or executing large amounts of suffering and moral disvalue. There seem to not be compelling-to-me reasons why such an AI would merely threaten as opposed to carrying out the harmful act, as long as there is more potential harm it could cause if left running. Similarly, such an agent may have incentive to cause large amounts of disvalue rather than just small amounts as this will increase the probability it is shut down, though perhaps it might keep a viable number of shut-down-initiating operators around.
Said differently: It seems to me like a primary safety feature we want for future AI systems is that humans can turn them off if they are misbehaving. A core problem with creating shut-down seeking AIs is that they are incentivized to misbehave to achieve shut-down whenever this is the most efficient way to do so. The worry is not just that shut-down seeking AIs will manipulate humans by managing the news, it’s also that they’ll create huge amounts of disvalue so that the humans shut them off.

Aaron_Scher 12 Mar 2023 0:40 UTC
12 points
1
on: Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk?
Good post!
His answer to “But won’t an AI research assistant speed up AI progress more than alignment progress?” seems to be “yes it might, but that’s going to happen anyway so it’s fine”, without addressing what makes this fine at all. Sure, if we already have AI research assistants that are greatly pushing forward AI progress, we might as well try to use them for alignment. I don’t disagree there, but this is a strange response to the concern that the very tool OpenAI plans to use for alignment may hurt us more than help us.
I think there’s a pessimistic reading of the situation which explains some of this evidence well. I don’t literally believe it but I think it may be useful. It goes as follows:
The internal culture at OpenAI regarding alignment is bleak: most people are not very concerned with alignment and are pretty excited about making their powerful AI systems. This train is already moving with a lot of momentum. The alignment team doesn’t have very much sway about the speed of the train or the overall direction. However, there is hope in the following place: you can try to develop systems that allow OpenAI to easily produce more alignment research down the line by figuring out how to automate alignment research. Then, if company culture shifts (e.g., because of warning shots), the option now exists to automate lots of alignment work quickly.
To be clear, I think this is a bad plan, but it might be relatively good given the situation, if the situation is so bleak. And if you think alignment isn’t that hard, then it’s plausibly a fine plan. If you’re a random alignment researcher bringing your skills to bear on the alignment problem, this is far from the first plan I would go with, but if you are one of the few people with a seat on the OpenAI train, this might be among your best shots at success (though I think people in such a position should be quite pessimistic and openly so).
I think this view does a good job explaining the following quote from Jan:
Once we reach a significant degree of automation, we can much more easily reallocate GPUs between alignment and capability research. In particular, whenever our alignment techniques are inadequate, we can spend more compute on improving them. Additional resources are much easier to request than requesting that other people stop doing something they are excited about but that our alignment techniques are inadequate for.

Aaron_Scher 1 Apr 2023 22:14 UTC
11 points
0
on: Hooray for stepping out of the limelight
In this interview from July 1st 2022, Demis says the following (context is about AI consciousness and whether we should always treat AIs as tools, but it might shed some light on deployment decisions for LLMs; emphasis mine):
we’ve always had sort of these ethical considerations as fundamental at deepmind um and my current thinking on the language models is and and large models is they’re not ready; we don’t understand them well enough yet — um and you know in terms of analysis tools and and guard rails what they can and can’t do and so on — to deploy them at scale because i think you know there are big still ethical questions like should an ai system always announce that it is an ai system to begin with probably yes um it what what do you do about answering those philosophical questions about the feelings uh people may have about ai systems perhaps incorrectly attributed so i think there’s a whole bunch of research that needs to be done first… before you know you can responsibly deploy these systems at scale that would be at least be my current position over time i’m very confident we’ll have those tools like interpretability questions um and uh analysis questions uh and then with the ethical quandary you know i think there it’s important to uh look beyond just science that’s why i think philosophy social sciences even theology other things like that come into it where um what you know arts and humanities what what does it mean to be human...

Aaron_Scher 4 Sep 2023 23:05 UTC
10 points
5
in reply to: TurnTrout’s comment on: Paper: On measuring situational awareness in LLMs
On the other hand, the difficulty of alignment is something we may want all AIs to know so that they don’t build misaligned AGI (either autonomously or directed to by a user). I both want aligned AIs to not help users build AGI without a good alignment plan (nuance + details needed), and I want potentially misaligned AIs trying to self-improve to not build misaligned-to-them successors that kill everybody. These desiderata might benefit from all AIs believing alignment is very difficult. Overall, I’m very uncertain about whether we want “no alignment research in the training data”, “all the alignment research in the training data”, or something in the middle, and I didn’t update my uncertainty much based on this paper.

Aaron_Scher 28 Oct 2022 4:41 UTC
9 points
3
on: What does it take to defend the world against out-of-control AGIs?
At a high level, I might summarize they key claims of this post as “It seems like the world today is quite far from being secure against a misaligned AGI. Even if we had a good AGI helping, the steps that would need to be taken to get to a secure state are very unlikely to happen for a variety of reasons: we don’t totally trust the good AGI so we won’t give it tons of free reign (and it would likely need free reign in order to harden every major military / cloud company / etc.), the good AGI is limited because it is being good and thus not doing bold sometimes illegal things (which would need to be done), the people deploying the good AGI follow social norms and usually act in accordance with laws and political norms, and there is likely a default favoring to offense over defense. Contingent on survival, the most likely way to get there is an AGI does outside-the-Overton-window things without humans approving every step it takes, so technical research should tentatively aim to create AGI systems that have good motivations but do not rely on constant human oversight.”
I broadly agree with these ideas; I think this picture is both gloomy and the general shape of it seems correct. Let me know if I’m off on something; there are, of course, important details I didn’t include.

Aaron_Scher 30 Jun 2023 7:41 UTC
8 points
1
on: Instrumental Convergence? [Draft]
This might not be a useful or productive comment, sorry. I found this paper pretty difficult to follow. The abstract is high level enough that it doesn’t actually describe the argument your making in enough detail to assess. Meanwhile, the full text is pretty in the weeds and I got confused a lot. I suspect the paper would benefit from an introduction that tries to bridge this gap and lay out the main arguments / assumptions / claims in 2-3 pages — this would at least benefit my understanding.
Questions that I would want to be answered in that Intro that I’m currently confused by: What is actually doing the work in your arguments? — it seems to me like Sia’s uncertainty about what world W she is in is doing much of the work, but I’m confused. What is the natural language version of the appendix? Proposition 3 in particular seems like one of the first places where I find myself going “huh, that doesn’t seem right.”
Sorry this is pretty messy feedback. It’s late and I didn’t understand this paper very much. Insofar as I am somebody who you want to read + understand +update from your paper, that may be worth addressing. After some combination of skimming and reading, I have not changed my beliefs about the orthogonality thesis or instrumental convergence in response to your paper. Again, I think this is mostly because I didn’t understand key parts of your argument.

Aaron_Scher 2 Dec 2023 18:58 UTC
6 points
0
on: MATS Summer 2023 Postmortem
On average, scholars self-reported understanding of major alignment agendas increased by 1.75 on this 10 point scale. Two respondents reported a one-point decrease in their ratings, which could be explained by some inconsistency in scholars’ self-assessments (they could not see their earlier responses when they answered at the end of the Research Phase) or by scholars realizing their previous understanding was not as comprehensive as they had thought.
This could also be explained by these scholars actually having a worse understanding after the program. For me, MATS caused me to focus a bunch on a few particular areas and spend less time at the high level / reading random LW posts, which plausibly has the effect of reducing my understanding of major alignment agendas.
This could happen because of forgetting what you previously knew, or various agendas changing and you not keeping up with it. My guess is that there are many 2 month time periods in which a researcher will have a worse understanding of major research agendas at the end than at the beginning — though on average you want the number to go up.

Aaron_Scher 4 May 2023 2:14 UTC
6 points
2
on: Are Emergent Abilities of Large Language Models a Mirage? [linkpost]
Strong upvote because I want to signal boost this paper, though I think “It provides some evidence against the idea that “understanding is discontinuous”″ is too strong and this is actually very weak evidence.
Main ideas:
Emergent abilities, defined as being sharp and unpredictable, sometimes go away when we adopt different measurement techniques, or at least they become meaningfully less sharp and unpredictable.
Changing from non-linear/discontinuous metrics (e.g., Accuracy, Multiple Choice Grade) to linear/continuous metrics (e.g., Token Edit Distance, Brier Score) can cause lots of emergent abilities to disappear; Figure 3, much of the paper.
The authors find support for this hypothesis via:
- Using different metrics for GPT math performance and observing the results, finding that performance can look much less sharp/unpredictable with different metrics
- Meta-analysis: Understanding alleged emergent abilities in BIG-Bench, finding that there is not very much of it and 92% of emergent abilities appear when the metric is Multiple Choice Grade or Exact String Match; these are metrics we would expect to behave discontinuously; Figure 5. Additionally, taking the BIG-Bench tasks LaMDA displays emergence on and switching from Multiple Choice Grade to Brier Score causes emergence to disappear
- Inducing emergence: Taking models and tasks which do not typically exhibit emergence and modifying the metric to elicit emergence. Figures 7, 8.
Sometimes emergent abilities go away when you use a larger test set (the small models were bad enough that their performance was rounding to zero on small test sets); Figure 4 compared to Figure 3 top. This may work even if you are still using a non-linear metric like Accuracy.
Observed emergent abilities may be in part due to sparsely sampling from models with lots of parameters (because it’s costly to train multiple); Figure 7.
What I’m taking away besides the above:
I think this paper should give hope to those trying to detect deception and other dangerous model capabilities. While the downstream tasks we care about might be quite discontinuous in nature (we might be fine with an AI that can design up to 90% of a pathogen, but very dead at 100%), there is hope in identifying continuous metrics that we can measure which are correlated. It’s likely pretty hard to design such metrics, but we would be shooting ourselves in the foot to just go “oh deception will be emergent so there’s no way to predict it ahead of time.” This paper gives a couple ideas of approaches we might take to preventing that problem: designing more continuous and linear metrics, creating larger test sets, and sampling more large models.
The paper doesn’t say “emergence isn’t a thing, nothing to worry about here,” despite the provocative title, it gestures toward approaches we can take to make the unpredictable thing more predictable and indicates that the current unpredictability is largely resolved through different metrics, which is exactly what we should be trying to do when we want to avoid dangerous capabilities.

Aaron_Scher 1 Apr 2023 20:15 UTC
6 points
3
in reply to: philipn’s comment on: Nobody’s on the ball on AGI alignment
A few people have already looked into this. See footnote 3 here.

Aaron_Scher 24 Mar 2023 20:52 UTC
6 points
0
in reply to: Daniel Kokotajlo’s comment on: Deep Deceptiveness
Thanks for clarifying!
I expect such a system would run into subsystem alignment problems. Getting such a system also seems about as hard as designing a corrigible system, insofar as “don’t be deceptive” is analogous to “be neutral about humans pressing stop button.”

Aaron_Scher 7 Nov 2023 3:28 UTC
5 points
0
on: Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk
I think this essay is overall correct and very important. I appreciate you trying to protect the epistemic commons, and I think such work should be compensated. I disagree with some of the tone and framing, but overall I believe you are right that the current public evidence for AI-enabled biosecurity risks is quite bad and substantially behind the confidence/discourse in the AI alignment community.
I think the more likely hypothesis for the causes of this situation isn’t particularly related to Open Philanthropy and is much more about bad social truth seeking processes. It seems like there’s a fair amount of vibes-based deferral going on, whereas e.g., much of the research you discuss is subtly about future systems in a way that is easy to miss. I think we’ll see more and more of this as AI deployment continues and the world gets crazier — the ratio of “things people have carefully investigated” to “things they say” is going to drop. I expect in this current case, many of the more AI alignment people didn’t bother looking too much into the AI-biosecurity stuff because it feels outside their domain and more-so took headline results at face value, I definitely did some of this. This deferral process is exacerbated by the risk of info hazards and thus limited information sharing.

Aaron_Scher

Some In­tu­itions Around Short AI Timelines Based on Re­cent Progress

Some of my pre­dictable up­dates on AI

Evidence from Microsoft Sydney

Some Intuitions Around Short AI Timelines Based on Recent Progress

Some of my predictable updates on AI