This line of thinking seems very plausible and important to me. Can you share a bit about how you arrived at it and can you recommend related material? Also, do you have ideas about how it could be empirically tested and refined?
roha
If our abstraction of good is contingent on specifics of the social primate developmental context, should we expect the abstraction of good in LLMs to be substantially different? If so, how could we find it out before handing over our fate to them? Is this the only abstraction where divergence would be a problem?
I think the plausibility might differ between different kinds of systems. For example: What if you replace car with computer operating system, and add that it has been hardened for a billion years by an optimization process and as one result is full of error-correcting/self-repair mechanisms? Does that change how plausible it is that observed issues mostly have external causes rather than the system breaking under its own normal operation?
Replace always with in the majority of cases and the idea seems fine. A complication is also that genetic problems we currently understand are mostly monogenic ones, while in the territoty we should expect a lot of polygenic issues that we can’t put on the map yet.
I agree that the system can break by itself without specific external cause, but I also think from my own observations that current medical practice is extremely often accepting too shallow explanations and applying far suboptimal treatments because of it. I like the look-until-the-cause-is-external heuristic, though it may be unachievable in many cases (e.g. polygenic diseases are poorly understood) and occasionally wrong when the cause is actually a random degradation inside the body.
I’m not familiar with it. I’d guess that a formally verified kernel would be a solid first step towards a secure operating system that even successor models of Mythos won’t be able to attack (sans hardware vulnerabilities that can be exploited by software and can’t be captured by a formal specification).
My thought process on securing hardware: If SOTA models can find obscure vulnerabilites in software as well as attack strategies that exploit one or several of them, I assume mankind can not be far from having models that are able to discover novel hardware problems (e.g. something like GPUHammer) and utilize them, though the feedback loop for experimentation might be much trickier to be set up than in the software case. If some of these new hardware flaws can’t be fixed by a firmware update or disabling problematic functionality on critical infrastructure, then physical devices will need to be replaced, which in my model of the world should happen at a much slower pace than the writing and distribution of software patches. If defenders have an advantage by getting earlier model access, it could be negated if downstream fixes can’t arrive fast enough to outpace the attackers.
I agree that software vulnerabilities are not a law of nature but essentially a skill and resource issue. If mankind manages with the help of AI to create operating systems and applications without any exploitable bugs, which is at least a conceptual possibility, there’s still the hardware layer and the social layer that can be targeted. I think hardware can in principle be fixed as well, though at a slower pace that might give attackers a relevant advantage. I don’t think human users can possibly be fixed. So point 2 and 3 of OP look to me like permanent issues we didn’t have before and won’t get rid of, i.e. an irreversible change of the game state. I suppose the larger issues will come in other fields though, where hardening potential is equally or more limited and potential damage is much larger, e.g. in biosecurity and autonomous weapon systems.
What would programming look like if writing tests could increase the chance of a bug appearing in the code, not just the chance to discover an existing bug? I guess it would depend on the precise mechanism and one would try to understand the linkage and decouple the two activities rather than attempting to minimize problems by getting rid of tests.
Related: How much ahead are capabilities of unpublished or undisclosed models? I’d like to read estimates based on extrapolation from past observations. Is anybody aware of such?
My intuition is that 3.3 gaining more traction would be good for the world, because to me a) that seems most realistic based on the evidence I’ve seen so far and how I interpret it, and b) least problematic in case it’s wrong and we live in a world where alignment isn’t hard and organisations act competently. What reasons make your intuition point towards 3.2?
We now have multiple non-negligible vectors for causing widespread catastrophe and possible extinction, not just among humans, but against all life on the planet.
Is there any vector other than misaligned ASI that could kill all bacteria on the planet and end biological evolution? I couldn’t name one but perhaps there are indeed some I’m unaware of yet?
most countries have done little of substance in response, because their incentives to ignore the consensus outweigh the perceived risk
The perceived risk could shift sufficiently for substantive responses if mankind runs into a large enough accident without being disempowered or extinguished yet. Whether coordination would work under such a scenario is another question. It’s also not something to count on, to hope for and certainly not to strive for.
I think if its first attempt fails, it may have many other subsequent ones, depending on how visible the previous ones were and how well it hedged its position. For example, if a pathogen didn’t work out as intended due to a sim-to-real gap, but we’ve not even detected it or where it came from, the ASI can try a different strategy. If we did notice it and try to react to it in panic, the ASI may long have exfiltrated itself to an unknown location/substrate and continue with another plan. Speaking in the third historical analogy: If the Ardennes had actually stopped the advance, the Germans would still be there and attempt another strategy (e.g. direct assault on the Maginot line with novel technology as @RedMan mentioned in another comment) that could still put France out.
In contrast, if our first attempt fails, we won’t get a second try with a different strategy.
Do you think advances in mechanistic interpretability can meaningfully reduce the probability of a failure during one or several critical tries, for example by detecting scheming, alignment faking, sandbagging, etc. in one or more involved models?
In the historical analogies of irrevocable failures, it seems to be the case that better understanding of one component that caused it could have meaningfully improved chances of success (software update behavior, valve behavior, specific adversarial army capabilities). These were less cursed problems and the component that would have needed more hardening wasn’t known beforehand, but in case anybody would have spent more hardening work on it, the failure could realistically have been prevented (and another failed example would have to be selected here instead).
I agree that 1) the term “oneshot” is quite overloaded with different meanings and 2) it is plausible that this contributes to some of the (initial) misunderstanding with audiences that often come in contact with another meaning than the intended one.
Fair, I should have been more precise about the placement of the “must”: Inside the if-then rule, not in the outer game description. The card frame and the bar frame differ in how the rule is expressed (not just here, also in the Wikipedia article), which I guess strongly influences how people parse it into a logical relationship in their mind.
(Besides: The bar scenario also comes with a prior understanding of how to parse the rule, since everyone is familiar with it and its exact meaning, while the card scenario does not and therefore has more room for a parsing failure.)
I’ve not seen this before and got it wrong while sitting in a noisy coffeehouse and giving it a short moment of thought. I’d have checked the correct cards plus the blue one.
I think the main reason for the performance difference between the two versions is that in the second version the rule is expressed by using “need to” or “must”, which leads to parsing it clearly as material condition rather than some other (confused) logical relation between the two statements.
I’d guess people would perform better on a slightly reformulated version: “Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face must be blue?”
Thank you for the restraint it took to talk about drone swarms, which everyone can palpably understand, in contrast to more realistic scenarios, which only a fraction of people are willing to imagine and take seriously. It’s a bug, yes, but there’s no easy patch and trying to fix it is not the task that needs to be solved.
Thanks also for pointing out the optimists selective pessimism: ASI cannot possibly be banned without a worldwide tyranny, but ASI will surely be beneficial if we scale up whatever works first.
Not sure this is a coherent thought since I’m sleep deprived, but I’ll share it anyways before going to bed: Rather than predicting what the phantom’s planning module would do, is it be cheaper to reuse your own planning module for that prediction (assuming it’s already closely enough converged towards the phantom’s one)? For predicting other agents, you’d probably build a model of their planning module, but for predicting an agent similar enough to yourself, you might be able to reuse machinery rather than build yet another model?