This is a dumb question but… is this market supposed to resolve positively if a misaligned AI takes over, achieves superintelligence, and then solves the problem for itself (and maybe shares it with some captive humans)? Or any broader extension of that scenario?
My timelines are not that short, but I do currently think basically all of the ways I expect this to resolve positively will very heavily rely on AI assistance, and so various shades of this question feel cruxy to me.
I honestly didn’t think of that at all when making the market, because I think takeover-capability-level AGI by 2028 is extremely unlikely.
I care about this market insofar as it tells us whether (people believe) this is a good research direction. So obviously it’s perfectly ok to resolve YES if it is solved and a lot of the work was done by AI assistants. If AI fooms and murders everyone before 2028 then this is obviously a bad portent for this research agenda, because it means we didn’t get it done soon enough, and it’s little comfort if the ASI solves interp after murdering or subjugating all of us. So that would resolve N/A, or maybe NO (not that it will matter whether your mana is returned to you after you are dead). If we solve alignment without interpretability and live in the glorious transhumanist utopia before 2028 and only manage to solve interpretability after takeoff, then… idk, I think the best option is to resolve N/A, because we also don’t care about that when deciding whether today whether this is a good agenda.
There are surely reasons to do ambitious interp that are not the stated goal of ambitious interp? I doubt we will have a fully understandable model by 2028, but I still think the abstractions developed in the process will be helpful.
For instance, many of the higher-order methods like SAEs are based on assumptions about how activation space is structured. Studying smaller systems rigorously can give us the ground truth for how models construct their activation space, that can allow us to question/modify said assumptions.
Unfortunately, prediction markets need some bright red line somewhere to be resolvable. I encourage you to make a different market that captures the thing you care about.
But people with the belief that we aren’t going to be able to fully understand models frequently take this as a reason not to pursue ambitious/rigorous interpretability. I thought that was the position you were taking, by using the market to decide whether the agenda is “good” or not.
This is a dumb question but… is this market supposed to resolve positively if a misaligned AI takes over, achieves superintelligence, and then solves the problem for itself (and maybe shares it with some captive humans)? Or any broader extension of that scenario?
My timelines are not that short, but I do currently think basically all of the ways I expect this to resolve positively will very heavily rely on AI assistance, and so various shades of this question feel cruxy to me.
I honestly didn’t think of that at all when making the market, because I think takeover-capability-level AGI by 2028 is extremely unlikely.
I care about this market insofar as it tells us whether (people believe) this is a good research direction. So obviously it’s perfectly ok to resolve YES if it is solved and a lot of the work was done by AI assistants. If AI fooms and murders everyone before 2028 then this is obviously a bad portent for this research agenda, because it means we didn’t get it done soon enough, and it’s little comfort if the ASI solves interp after murdering or subjugating all of us. So that would resolve N/A, or maybe NO (not that it will matter whether your mana is returned to you after you are dead). If we solve alignment without interpretability and live in the glorious transhumanist utopia before 2028 and only manage to solve interpretability after takeoff, then… idk, I think the best option is to resolve N/A, because we also don’t care about that when deciding whether today whether this is a good agenda.
There are surely reasons to do ambitious interp that are not the stated goal of ambitious interp? I doubt we will have a fully understandable model by 2028, but I still think the abstractions developed in the process will be helpful.
For instance, many of the higher-order methods like SAEs are based on assumptions about how activation space is structured. Studying smaller systems rigorously can give us the ground truth for how models construct their activation space, that can allow us to question/modify said assumptions.
Unfortunately, prediction markets need some bright red line somewhere to be resolvable. I encourage you to make a different market that captures the thing you care about.
I don’t care about prediction markets.
But people with the belief that we aren’t going to be able to fully understand models frequently take this as a reason not to pursue ambitious/rigorous interpretability. I thought that was the position you were taking, by using the market to decide whether the agenda is “good” or not.