If anyone wants to have a voice chat with me about a topic that I’m interested in (see my recent post/comment history to get a sense), please contact me via PM.
My main “claims to fame”:
Created the first general purpose open source cryptography programming library (Crypto++, 1995), motivated by AI risk and what’s now called “defensive acceleration”.
Published one of the first descriptions of a cryptocurrency based on a distributed public ledger (b-money, 1998), predating Bitcoin.
Proposed UDT, combining the ideas of updatelessness, policy selection, and evaluating consequences using logical conditionals.
First to argue for pausing AI development based on the technical difficulty of ensuring AI x-safety (SL4 2004, LW 2011).
Identified current and future philosophical difficulties as core AI x-safety bottlenecks, potentially insurmountable by human researchers, and advocated for research into metaphilosophy and AI philosophical competence as possible solutions.
It confused me that Opus 4.6′s System Card claimed less verbalized evaluation awareness versus 4.5:
but I never heard about Opus 4.5 being too evaluation aware to evaluate. It looks like Apollo simply wasn’t part of Opus 4.5′s alignment evaluation (4.5′s System Card doesn’t mention them).
This probably seems unfair/unfortunate from Anthropic’s perspective, i.e., they believe their models are becoming less eval aware, but due to Apollo’s conclusions being spread on social media, a lot of people probably got the impression that models are getting more eval aware. Personally I’m not sure we can trust Anthropic’s verbalized evaluation awareness metric, and wish Apollo had done evals on 4.5 too to give us an external comparison.