I think a lot about how to increase the chance of contributing something valuable to x-risk mitigation—and AI safety as my particular path. 5 weeks ago, @jaybailey suggested that one way to achieve that goal can be to Have an Unreasonably Specific Story about the Future (a Future That I Want, to be even more specific). Do I have a 5-minute story that helps to do this? I think yes: 1)The current environment and AI architecture incentivize continued growth in capabilities and scale, which also increase weaknesses and threats, and where agents are viewed as short-term profits and building trust at the expense of strategic resilience and rational safety. 2) Many stakeholders would lift their skepticism regarding the risks (and perhaps even advocate for a more cautious approach to AI in their own network) if presented with a compelling case of clear and present risk of irreversible damage to society, which would tilt the balance in favor of decisions that reduce the risk of catastrophic incidents 3) Creating such a case requires assessments of impacts and likelihoods that are a) accurate, high-elicitation, and up to date against the latest models, research, and threat intelligence. b) well-crafted to make immediate sense to the audience c) well publicized to reach the decision makers that matter 4) Each time I share a failure case / threat model / demo / paper, and a decision maker understands why it rises above the FUD, the alternative of stricter governance gains momentum. 5) When someone learns something, the next question is always the same. What do we do next? It’s the golden moment. We organize behavior change, working extremely efficiently through chokepoints (even simplifying processes and systems to create some if necessary), to re-align systems, people, and impact. If doing this looks like a lower rate of acceleration, it is not slowing down, and if it is, that just indicates an opportunity to allocate resources more optimally—and we just show that.
emile delcourt
The Factory-Gate Fallacy
emile delcourt’s Shortform
Really great post (and concise to boot)!
An opportunity exists to see actual model activations and logprobs of unverbalized blackmail in latent space (at least not verbalized in the limited sample: we’d expect an average of less than 3% blackmail based on the zeros observed but we don’t know if it’s 2% or 0.01%). Thanks to the low #parameters maybe training a set of linear probes is feasible?
What if work longer than one week was vastly sublinear in complexity for AI—meaning that doubling task duration doesn’t double difficulty for an agent?
This could be a corollary to your point on superexponential results, but maybe less from a capability standpoint but environmental overhangs to which humans may not have fully optimized. Especially in a multi agentic direction.
Scoping LLMs
Really appreciate this post! The recommendation “Evaluators should ensure that effective capability elicitation techniques are used for their evaluations” is especially important. Zero-shot, single-turn prompts with no transformations no longer seem representative of a model’s impact on the public (who, in aggregate or with scant determination, will be inflicting many variants of unsanctioned prompts with many shots or many turns)
I’m curious why in example 1 the ability to manipulate (“persuade”) is called a capability evaluation, making limited results eligible for the sandbagging label, whereas in example 6 (under the name “blackmail”) it is called an alignment evaluation, making limited results ineligible for that label?
Both examples tuned out the manipulation enough to hide it in testing, with worse results in production. Can someone help me to better learn the nuances we’d like to impose on sandbagging? Benchmark evasion is an area I started getting into only in november.
Can Large Language Models effectively identify cybersecurity risks?
Hi! Just introducing myself to this group. I’m a cybersecurity professional, enjoyed various deep learning adventures over the last 6 years and inevitably managing AI related risks in my information security work. Went through BlueDot’s AI safety fundamentals last spring with lots of curiosity and (re?)discovered LessWrong. Looking forward to visiting more often, and engaging with the intelligence of this community to sharpen how I think.
Thank you for sharing these findings. I’m leading a theme in OWASP securing agentic applications v2 where your input would be welcome (we’ll otherwise likely just cite the paper where relevant) - real-time behavioral kill switches.