Systems programmer, security researcher and tax law/policy enthusiast.
Dentosal
LLM would have said this better, and without all these typos too
Perhaps you should suspect me as well
Miscellaneous observations about board games
Omniscience one bit at a time: Chapter 3
Omniscience one bit at a time: Chapter 2
Omniscience one bit at a time: Chapter 1
The solution to akrasia apparently isn’t not having any goals
Is it really paranoia if I’m really Out to Get Me?
Anticheat: a non-technical look without psychoanalysis
Suffering is what makes it special
Parsing Validation
Just complaining about LLM sycophancy (filler episode)
Me consuming five different forms of media at once to minimize the chance of a thought occurring
Ink without haven
I’m drawing parallels between conventional system auditing and AI alignment assessment. I’m admittedly not sure if my intuitions transfer over correctly. I’m certainly not expecting the same processes to be followed here, but many of the principles should still hold.
We believe that these findings are largely but not entirely driven by the fact that this early snapshot had severe issues with deference to harmful system-prompt instructions. [..] This issue had not yet been mitigated as of the snapshot that they tested.
In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.
To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage
What I’m worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn’t diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.
Notes on Claude 4 System Card
You cannot incentivize people to make that sacrifice at anything close to the proper scale because people don’t want money that badly. How many hands would you amputate for $100,000?
There’s just no political will to do it, since the solutions would be harsh or expensive enough that nobody could impose them upon society. A god-emperor, who really wished to increase fertility numbers and could set laws freely without the society revolting, could use some combination of these methods:
If you’re childless, or perhaps just unmarried, you pay additional taxes. The amount can be adjusted to be as high as necessary. Alternatively, just raise the general tax rate and give reduction based on the number of children. If having children meant more money instead of less, that would help quite a bit.
Legally mandate having children. In some countries, men are forced into military service. You could require women to have children in similar way. Medical exceptions are already a thing for military service, they could apply here as well.
Remove VAT and other taxes from daycare services, and medical services for children.
Offer free medical services to children. And parents. (And everyone.)
Spend lots of money and research how to create children in artificial wombs. Do that.
The state could handle child-rearing, similar to how it works in Plato’s Republic. I.e. scale up orphanage system massively and make that socially acceptable.
Fix the education system, while you’re at it.
Forbid porn, contraception, and abortion.(I don’t think that actually helps)Deny women access to education beynd elementary school, and additionally forbid employment (likely helps, but at what cost)
Propaganda. Lots of it. Censorship as well.
Communication is indeed hard, and it’s certainly possible that this isn’t intentional. On the other hand, making mistakes is quite suspicious when they’re also useful for your agenda. But I agree that we probably shouldn’t read too much into it. The system card doesn’t even mention the possibility of the model acting maliciously, so maybe that’s simply not in scope for it?
While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:
We found it fruitful to think in terms of misaligned actors, where:
the user might be misaligned (the user asks for a harmful task),
the model might be misaligned (the model makes a harmful mistake), or
the website might be misaligned (the website is adversarial in some way).
Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI’s goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.
Later, the same phenomenon is described like this:
The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.
Is this yet another attempt to erode the meaning of “alignment”?
An astute reader pointed out that the Clueless designation might make more sense, if we consider ChatGPT inferior instead. I hadn’t consider that option, and it makes much more sense.