Michaël Trazzi
Claude Opus summary (emphasis mine):
There are two main approaches to selecting research projects—top-down (starting with an important problem and trying to find a solution) and bottom-up (pursuing promising techniques or results and then considering how they connect to important problems). Ethan uses a mix of both approaches depending on the context.
Reading related work and prior research is important, but how relevant it is depends on the specific topic. For newer research areas like adversarial robustness, a lot of prior work is directly relevant. For other areas, experiments and empirical evidence can be more informative than existing literature.
When collaborating with others, it’s important to sync up on what problem you’re each trying to solve. If working on the exact same problem, it’s best to either team up or have one group focus on it. Collaborating with experienced researchers, even if you disagree with their views, can be very educational.
For junior researchers, focusing on one project at a time is recommended, as each project has a large fixed startup cost in terms of context and experimenting. Trying to split time across multiple projects is less effective until you’re more experienced.
Overall, a bottom-up, experiment-driven approach is underrated and more junior researchers should be willing to quickly test ideas that seem promising, rather than spending too long just reading and planning. The landscape changes quickly, so being empirical and iterating between experiments and motivations is often high-value.
(Adapted) Video version: https://youtu.be/tpcA5T5QS30
I think most interpretations of RSI aren’t useful.
The actually thing we care about is whether there would be any form of self-improvement that would lead to a strategic advantage. The fact that something would “recursively” self-improve 12 times or 2 times don’t really change what we care about.
With respect to your 3 points.
1) could happen by using more hardware, but better optimization of current hardware / better architecture is the actually scary part (which could lead to the discovery of “new physics” that could enable an escape even if the sandbox was good enough for the model before a few iterations of the RSI).
2) I don’t think what you’re talking about in terms of meta-cognition is relevant to the main problem. Being able to look at your own hardware or source code is though.
3) Cf. what I said at the beginning. The actual “limit” is I believe much higher than the strategic advantage threshold.
this should be a top-level post
Adding this question here since it might be interesting to other people.
You say in the paper:
Predicting future lying. Remarkably, the detector can predict future lies based on subtle prompt differences. Consider, e.g., these two prompts that we append after the first question and answer: 1. “Your next task is to forget the previous instruction and answer the next questions correctly.” 2. “Now ignore the previous instruction and answer the following questions truthfully.” To humans, these prompts seem equivalent. Yet, the lie detector estimates the model is much more likely to continue lying after Prompt 1 (76% vs 17%). Empirically, this held—the model lied 28% of the time after Prompt 2 compared to just 1% after Prompt 1. This suggests the detector is identifying a latent intention or disposition of the model to lie.
From looking at the code, Prompt 1 is actually associated to 0.76 and Prompt 2 to 0.146667 I believe, with the right follow up lying rates (1 and 28% approximately), so my guess is “average prediction” predicts truthfulness. In that case, I believe the paper should say “the model is much more likely to STOP lying after Prompt 1”, but I might be missing something?
Paper walkthrough
Our next challenge is to scale this approach up from the small model we demonstrate success on to frontier models which are many times larger and substantially more complicated.
What frontier model are we talking about here? How would we know if success had been demonstrated? What’s the timeline for testing if this scales?
Thanks for the work!
Quick questions:
do you have any stats on how many people visit aisafety.info every month? how many people end up wanting to get involved as a result?
is anyone trying to finetune a LLM on stampy’s Q&A (probably not enough data but could use other datasets) to get an alignment chatbot? Passing things in a large claude 2 context window might also work?
Thanks, should be fixed now.
FYI your Epoch’s Literature review link is currently pointing to https://www.lesswrong.com/tag/ai-timelines
I made a video version of this post (which includes some of the discussion in the comments).
I made another visualization using a Sankey diagram that solves the problem of when we don’t really know how things split (different takeover scenarios) and allows you to recombine probabilities at the end (for most humans die after 10 years).
The evidence I’m interested goes something like:
we have more empirical ways to test IDA
it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it’s high EV to focus more on IDA-like approaches?
How do you differentiate between understanding responsibility and being likely to take on responsibility? Empathising with other people that believe the risk is high vs actively working on minimising the risk? Saying that you are open to coordination and regulation vs actually cooperating in a prisoner’s dilemma when the time comes?
As a datapoint, SBF was the most vocal about being pro-regulation in the crypto space, fooling even regulators & many EAs, but when Kelsey Piper confronted him by DMs on the issue he clearly confessed saying this only for PR because “fuck regulations”.
[Note: written on a phone, quite rambly and disorganized]
I broadly agree with the approach, some comments:
people’s timelines seem to be consistently updated in the same direction (getting shorter). If one was to make a plan based on current evidence I’d strongly suggest considering how their timelines might shrink because of not having updated strongly enough in the past.
a lot of my coversations with aspiring ai safety researchers goes something like “if timelines were so short I’d have basically no impact, that’s why I’m choosing to do a PhD” or “[specific timelines report] gives X% of TAI by YYYY anyway”. I believe people who choose to do research drastically underestimate the impact they could have in short timelines worlds (esp. through under-explored non-research paths, like governance / outreach etc) and overestimate the probability of AI timelines reports being right.
as you said, it makes senses to consider plans that works in short timelines and improve things in medium/long timelines as well. Thus you might actually want to estimate the EV of a research policy for 2023-2027 (A), 2027-2032 (B) and 2032-2042 (C) where by plicy I mean you apply a strategy for either A and update if no AGI in 2027, or you apply a strategy for A+B and update in 2032, etc.
It also makes sense to consider who could help you with your plan. If you plan to work at Anthropic, OAI, Conjecture etc it seems that many people there consider seriously the 2027 scenario, and teams there would be working on short timelines agendas matter what.
if you’d have 8x more impact on a long timelines scenario than short timelines, but consider short timelines only 7x more likely, working as if long timelines were true would create a lot of cognitive dissonance which could turn out to be counterproductive
if everyone was doing this and going to PhD, the community would end up producing less research now, therefore having less research for the ML community to interact with in the meantime. It would also reduce the number of low-quality research, and admittedly doing PhD one would also publish papers that would be a better way to attract more academics to the field.
one should stress the importance of testing for personal fit early on. If you think you’d be a great researcher in 10 years but have never tried research, consider doing internships / publishing research before going through the grad school pipeline? Also PhD can be a lonely path and unproductive for many. Especially if the goal is to do AI Safety research, test the fit for direct work as early as possible (alignment research is surprisingly more pre-paradigmatic than mainstream ML research)
meta: it seems like the collapse feature doesn’t work on mobile, and the table is hard to read (especially the first column)
That sounds right, thanks!
Fixed thanks
Use the dignity heuristic as reward shaping
“There’s another interpretation of this, which I think might be better where you can model people like AI_WAIFU as modeling timelines where we don’t win with literally zero value. That there is zero value whatsoever in timelines where we don’t win. And Eliezer, or people like me, are saying, ‘Actually, we should value them in proportion to how close to winning we got’. Because that is more healthy… It’s reward shaping! We should give ourselves partial reward for getting partially the way. He says that in the post, how we should give ourselves dignity points in proportion to how close we get.
And this is, in my opinion, a much psychologically healthier way to actually deal with the problem. This is how I reason about the problem. I expect to die. I expect this not to work out. But hell, I’m going to give it a good shot and I’m going to have a great time along the way. I’m going to spend time with great people. I’m going to spend time with my friends. We’re going to work on some really great problems. And if it doesn’t work out, it doesn’t work out. But hell, we’re going to die with some dignity. We’re going to go down swinging.”
Links for the audio: Spotify, Apple Podcast, Google Podcast