Interesting work! I loved the nuclear engineering example. A couple of questions I had while reading:
I’m a bit confused about how much the SPH is tracking phase transitions per se. As I understand it, it seems like it’s measuring something like “when the loss across datapoints of interest starts moving together,” which seems like a pretty interesting and relevant thing to track, but not obviously a phase transition in the way I’m imagining it (i.e. a sharp discontinuity). Would you not see one eigenvalue beginning to trump the rest in cases where the model is gradually learning a skill? Intuitively, it seems like lowering the threshold at which you consider an eigenvalue to meaningfully increase above the others might account for the difference between gradual and sharp skill acquisition. But if not, is that a problem? I.e., if not all skills are well modeled as going through phase transitions, then do we need different early detection systems for them?
I’m also interested in how the datapoints are chosen for the probe. This seems like one of the more important steps. Intuitively, to get a representative sample of the skill in question, it seems like you should chose a few different-ish datapoints. E.g., if you’re interested in scheming, maybe you choose ones about evaluation awareness, goal-seeking behavior, blackmailing, etc. But how do you know that this sampling of datapoints is representative enough to catch all instances of skills developing in the model related to scheming? Is there a more principled way of doing this?
I worry about this too. It seems perhaps unprecedented, the amount AI risk is a problem relative to the amount we know what to do about it. I’m not sure humanity has ever been in a position where these have diverged so much; it’s very disorienting and desperation-inducing. And I think this sometimes makes working on x-risk fraught, in ways like you describe. I find myself grasping sometimes, and I can feel the franticness trying to grab onto things—this agenda, this plan, this technique—in the hope that just maybe therein lies something which could assuage the fear.
When I first really grokked what death was as a child, I asked the adults around me about it, assuming that I was just confused or misinformed, since surely they would also be scared if they knew. But their reassurances were dissatisfying: “you won’t feel anything when you’re dead,” “it’s really far away, don’t worry about it.” And even back then I could tell that something strange was going on—these responses woven from flinches, expertly avoiding contact with the terror. And it made me feel more unsettled, this darker truth lurking beneath the surface. What else was there?
Perhaps because of this, or perhaps because of my constitution, I’ve always had a hard time not looking straight at terrifying things. It really disturbed me how easily people could flinch away from something so important, and so scary. How easily rationalizations formed, how easy it was to just pretend it wasn’t really there, to write it off with just-so stories. And then at some point as an adult I grokked x-risk, and I looked around at the arguments people made about it, and I had the same feeling I’d had as a child first grokking death. People were talking about it in a detached way, writing essays about how X evidence or Y technique counts as progress toward our survival. And I looked at these with the same feeling of heartbreak and dissatisfaction as before. These were the arguments which supposedly rendered the threat of extinction smaller, or barely a risk at all? And I couldn’t help wondering if the same generator lurked. That whatever it is which causes people to turn the terrifying into pleasant abstractions—small, far away, somehow fine—was also happening here.
It’s hard, I think, for people to take as input “terror” and “lack of a solution.” Or even worse, here: lack of knowing what a solution even looks like, or what seeming-steps toward it are steps at all. I don’t think this is a stable mental configuration for nearly anyone; minds begin to grasp for ways out: belief that the situation somehow isn’t as dire, or that some plan will surely work, etc. Which isn’t to say, necessarily, that those beliefs are wrong. But I worry they are! I worry that in our rush to do anything about a confusing, terrifying, ultimate threat, we have ended up with many plans in need of purpose. That when a group of people are desperate to do anything, and to believe they are doing anything, they create vacuums. New metrics of success rush in, this eagerness to claim the legible and the presently tractable as progress toward that solution. For there is only so long minds and groups and movements can sanely go on in the hope of one day succeeding at indeterminate aims. At some point, something tends to give.
This is one of the biggest dangers, to my mind, of working on x-risk. I think not getting swept away by these these currents is incredibly difficult; I certainly struggle with it. In my early twenties, I often felt intense anxiety about death and it took me a long time to figure out a better mental configuration: one which allowed me to accept the unacceptability of death without falling apart. It wasn’t easy, and I can see why most people don’t want to do it. But I worry that easier orientations often end up leading to blindspots which can end up becoming threats themselves. And while I don’t wish for people to experience such terror, I do wish they could somehow take these problems seriously—to look clearly at them, and to accept when they are confused, or do not know what to do. My sense is that work on x-risk often ends up counterproductive without this.