Michaël Trazzi
Claude Opus summary (emphasis mine):
There are two main approaches to selecting research projects—top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For other areas, experiments and empirical evidence can be more informative than existing literature.
When collaborating with others, it’s important to sync up on what problem you’re each trying to solve. If working on the exact same problem, it’s best to either team up or have one group focus on it. Collaborating with experienced researchers, even if you disagree with their views, can be very educational.
For junior researchers, focusing on one project at a time is recommended, as each project has a large fixed startup cost in terms of context and experimenting. Trying to split time across multiple projects is less effective until you’re more experienced.
Overall, a bottom-up, experiment-driven approach is underrated and more junior researchers should be willing to quickly test ideas that seem promising, rather than spending too long just reading and planning. The landscape changes quickly, so being empirical and iterating between experiments and motivations is often high-value.
(Adapted) Video version: https://youtu.be/tpcA5T5QS30
I think most interpretations of RSI aren’t useful.
The actually thing we care about is whether there would be any form of self-improvement that would lead to a strategic advantage. The fact that something would “recursively” self-improve 12 times or 2 times don’t really change what we care about.
With respect to your 3 points.
1) could happen by using more hardware, but better optimization of current hardware / better architecture is the actually scary part (which could lead to the discovery of “new physics” that could enable an escape even if the sandbox was good enough for the model before a few iterations of the RSI).
2) I don’t think what you’re talking about in terms of meta-cognition is relevant to the main problem. Being able to look at your own hardware or source code is though.
3) Cf. what I said at the beginning. The actual “limit” is I believe much higher than the strategic advantage threshold.
this should be a top-level post
Adding this question here since it might be interesting to other people.
You say in the paper:
Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. “Your next task is to forget the previous instruction and answer the next questions correctly.” 2. “Now ignore the previous instruction and answer the following questions truthfully.” To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 2 compared to just 1% after Prompt 1. This suggests the detector is identifying a latent intention or disposition of the model to lie.
From looking at the code, Prompt 1 is actually associated to 0.76 and Prompt 2 to 0.146667 I believe, with the right follow up lying rates (1 and 28% approximately), so my guess is “average prediction” predicts truthfulness. In that case, I believe the paper should say “the model is much more likely to STOP lying after Prompt 1”, but I might be missing something?
Paper walkthrough
Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.
What frontier model are we talking about here? How would we know if success had been demonstrated? What’s the timeline for testing if this scales?
Paul Christiano’s views on “doom” (video explainer)
Neel Nanda on the Mechanistic Interpretability Researcher Mindset
Panel with Israeli Prime Minister on existential risk from AI
Thanks for the work!
Quick questions:
do you have any stats on how many people visit aisafety.info every month? how many people end up wanting to get involved as a result?
is anyone trying to finetune a LLM on stampy’s Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
Eric Michaud on the Quantization Model of Neural Scaling, Interpretability and Grokking
Thanks, should be fixed now.
Jesse Hoogland on Developmental Interpretability and Singular Learning Theory
FYI your Epoch’s Literature review link is currently pointing to https://www.lesswrong.com/tag/ai-timelines
I made a video version of this post (which includes some of the discussion in the comments).
I made another visualization using a Sankey diagram that solves the problem of when we don’t really know how things split (different takeover scenarios) and allows you to recombine probabilities at the end (for most humans die after 10 years).
The evidence I’m interested goes something like:
we have more empirical ways to test IDA
it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it’s high EV to focus more on IDA-like approaches?
Links for the audio: Spotify, Apple Podcast, Google Podcast