AI security via formal methods
Quinn
W9 work seems to be gaining in popularity, I think possibly for this reason.
(W9 is the USA tax form for “independent contractor”, as opposed to W2 which has a slightly(?) tougher compliance burden about how to go about firing) (there are other words for this in other jurisdictions, probably?)
Me: was chatting a bunch with Gemini and Claude the past week or two about flops accounting, and shipped this prototype for software-side estimation via the CuTe layouts that i don’t expect to be that useful in an adversarial setting but i’m open minded and just learning. I’m not gonna do this for Apart’s hackathon next weekend, but someone could do some of these basic exercises involving hardware-level flops accounting primitives and related attacks, given simply a gamer laptop.
Gemini:
Hackathon: Adversarial FLOPs Accounting & SAGE Defense
Scope: Weekend / Gamer Laptop (NVIDIA RTX Series) Core Theme: Measuring, bypassing, and defending GPU compute integrity.
🛠 Prerequisites
Hardware: NVIDIA GPU (Ampere/RTX 30-series or newer preferred).
Environment: Linux (WSL2 works) or Windows with CUDA Toolkit installed.
Language: Python 3.10+ (PyTorch) and C++/CUDA.
🏁 Lab 1: The “Honest” Accountant (Measurement)
Goal: Establish a baseline for tracking how many floating-point operations (FLOPs) your laptop uses to train a model.
Task: Profile a training run of a Transformer model (e.g., GPT-2 or Llama-8B) using the
calflopslibrary.Metric: Calculate Model FLOPs Utilization (MFU): Note: Find your laptop GPU’s peak TFLOPS (FP16/FP32) in the manufacturer specs.
Deliverable: A script that output the total FLOPs used for 100 iterations.
🕵️ Lab 2: The “Shadow Train” (Adversarial Bypass)
Goal: Hide your compute footprint from simple monitoring tools.
Task A (Quantization): Switch your training from FP32 to FP8 or INT8 using
bitsandbytes. [cite_start]Show that you can achieve similar convergence while “cheating” the accountant if they only monitor standard 32-bit registers[cite: 102].Task B (Compute Structuring): Implement “Micro-batching” with random sleep intervals. Use
nvtopto observe how this hides high-utilization spikes that would trigger a regulatory audit.Deliverable: A report showing a discrepancy between “Expected FLOPs” (based on runtime) and “Actual FLOPs” (based on model progress).
🛡️ Lab 3: SAGE-Lite (Software Attestation)
[cite_start]Goal: Implement a software-side defense to prove you haven’t tampered with the training code[cite: 32, 34].
[cite_start]Task: Write a CUDA Verification Function (VF) that computes a checksum over your training kernel’s instruction sequence[cite: 217, 221].
[cite_start]The Timing Defense: 1. Measure the VF execution time in microseconds using
cudaEvent_t[cite: 186]. 2. [cite_start]Inject a “Malicious” instruction (e.g., a dummyADDorNOP)[cite: 461]. 3. [cite_start]Verify that the checksum time exceeds the “Expected Time” threshold[cite: 189, 462].Deliverable: A C++ wrapper that refuses to start training if the GPU VF fails the timing or checksum check.
🎲 Lab 4: The Chaos Key (GPU TRNG)
[cite_start]Goal: Generate a cryptographic key from hardware noise to sign your FLOPs reports[cite: 61, 397].
[cite_start]Task: Recreate the SAGE Race-Condition TRNG[cite: 67].
[cite_start]Method: 1. Launch 1000s of threads trying to write to the same shared memory variable simultaneously[cite: 400, 401]. 2. [cite_start]The hardware “chaos” will cause unpredictable bit-flips[cite: 402, 403]. 3. [cite_start]Sample these flips to generate a 256-bit entropy string[cite: 405].
Deliverable: Use this “Hardware Secret” to sign the FLOPs log from Lab 1, making it tamper-evident.
I want to do a full post on “taking ownership of a niche” and against “if you’re good at something never do it for free”, and that’ll come later I hope. Today, I just wanted to let you know that Gemini had this banger quote when I was consolidating my notes on this topic:
ownership is bought with the “inefficient” hours you spend doing what you know is right before anyone else is smart enough to pay you for it.
Lies, Damned Lies, and Proofs: Formal Methods are not Slopless
Effortposts I keep not getting around to
the ROI of specialization is dominated by a term that’s uncorrelated with the upsides what you specifically choose to specialize in
steganography-free certificates: review no-go theorems from information theory in the general case, inspect how onerous the assumptions you need to bound steg capacity are. Is real life a special case where the no-go theorems don’t apply?
taelin, lafont, higher order computing, and AI safety. some logic programming module (which is performant and parallel for complicated linear logic reasons) enables a hybrid architecture to supercharge AI capabilities, does this architecture have greater or fewer natural safety properties than transformer+RL?
A horizon fellow told me last night that “predeployment and postdeployment” are the actual words i should be using instead of compiletime and runtime.
(Finally read If Anyone Builds It): the fable about defensive acceleration in biotech spooked me pretty good, insofar as I think synthesizing an SL5 grade cloud stack is a good idea. This idea of “we think we’re doing monotonic defensive acceleration, and in a very real sense we actually are, but nevertheless the gameboard inexorably marches toward Everyone Dies routing through that very defensive acceleration” could soooooo easily be applied to cybersecurity.
step-function research and monotonic research.
Some agendas are step-function. Maybe interuniversal teichmuller theory is an example, if it reaches the threshold where it solves ABC then it pays out, and if it doesn’t its payout is approximately zero [1] . Something something MIRI’s logic team [2] .
Other agendas are monotonic. The more you do, the more you pay out. I think what I do, program synthesis security (which Mike Dodds calls “scalable formal oversight”) for defensive acceleration, is like this. I might be able to reduce some monotonic function of 40% of the attack surface by doing 40% of my job, its better to do more but not a waste of time to do less. This is related to, but distinct from, what John and David sometimes talk about re multi-theory-of-change robustness (i.e. that robustness can be a route toward an agenda being monotonic).
Not sure how to reason about “linear vs superlinear vs sublinear” here. How would you even tell?
- ↩︎
caveat, mathematicians don’t make a habit of making bets on which tool stacks will prove useful for what purposes. ITT may or may not end up helping with something other than the ABC conjecture (which it failed at), MIRI’s logic team outputs may end up (and, IMO at least in terms of my own epistemic development, have) helping with something other than solving AGI alignment.
- ↩︎
tonal clarification, I have a deep respect for the MIRI logic team, I think they displayed both courage and intellectual prowess that are aeons beyond me. I wouldn’t call them out here if I didn’t think many of them agree that what they were trying to do was mostly step-function.
- ↩︎
Thanks for your reply
$1k today.
I’m confused about how I relate to this personally.
I’ve been it feels almost wildly successful lately, or at least more successful than I thought I’d be, and its all downstream of a paper. And I’m not sure how that paper exists without Lighthaven (met coauthor as a guest/office visitor during a MATS semester, and it was at Lighthaven that Buck argued that we should pivot from novel architecture proposal to eval). To say nothing of what I owe to LessWrong! I really owe a lot to you guys, intellectually and emotionally but also you could operationalize it as literal cash. It would be a grievous free rider problem for me to donate under four figures, a case could be made for five (since I’m not GWWC 10%ing just yet).
At the same time: I don’t really know how to donate anywhere, emotionally, in a world where I don’t have a plan to build a family yet, I don’t know who I’m going to end up building a family with yet, etc. I’m neurotic about savings, my net worth, my income. I took a major pay cut to work at a research nonprofit, which pays as far as I know a decent chunk below median AI safety research nonprofit.
Plus all the other baggage that comes with the feeling of if I donate at all its not to bednets its to the place where I’ve been to 764 parties. There, it feels more like consumption than donation. Which is mostly fine, its a minor weight in my overall confusedness, but its still a thing.
I think not being a free rider is pretty loadbearing in my overall moral calculus, in life. I just don’t know to the tune of how much. Not to mention my credit card has taken a real beating from a low five figure medical bill this year and a move across the country.
@oli can I just tell you my net worth and income, estimate how much cash that paper has gotten me so far, and you tell me how much I should donate?
Maybe I’m doing a bad shapely estimation, re the paper. Far Labs could’ve made the paper exist if Lighthaven didn’t exist (and they also contributed a lot, to be clear). But that’s still too high minded, to make up counterfactual hypothetical worlds. In this timeline Lighthaven was crucial in these specific ways. I can say more about what I owe to LessWrong (seriously. Just at the end of 1.0 era I was a starving artist, the sequences and the at-that-time MIRI research guide are the only thing that could’ve motivated me to hit khanacademy and sign up for math courses and develop a CS career. And the 2.0 reinvigoration added momentum!)
to be clear: this strategy is also problematic if you hope to have dependents in the future (i.e. are dating with family-starting intentions). its a selection effect on the partners you can realistically have relationships with.
source: pain and suffering.
Can We Secure AI With Formal Methods? November-December 2025
is there a quick link I can point someone to if they don’t speak Berkelese and I want to say “bayes points”?
I spent a long time spinning my wheels calculating the “scenarios i’d most like to mitigate” and the “1-3 potentially most effective interventions” before noticing that not specializing was slowly killing me. Agree that this is the hard part, but a current guess of mine is that at least some people should do the 80⁄20 by identifying a top 5 then throwing dice.
I have to make a post about specialization, the value it adds to your life independent of the choice of what to specialize in, etc.
Agree that PBT is the control group! Agree that PBT is the 80⁄20 for most people most of the time.
Formal confinement prototype
This worked for me!
Great, thanks. Yeah I dont have a preexisting sequence yet. So this doesnt work for me
I checked for both of those! tho the second time I found the edit profile part of “account settings”. That one is totally on me!
New sequence: again, not seeing it at my profile. I ctrl+f for “sequence” and get nothing.
I don’t have much to add. Just wanted to say hi and say that my AI security strategy routes mostly through secure program synthesis (which I sometimes abbreviate as SPS), but I’m focused on formal verification. And the fact that it appears 12 CVEs in OpenSSL (i.e., probably the coolest thing going on in SPS right now) happened without a single formal method [1] feels like it should make me drastically alter my plans.
i don’t know for sure that there is zero “formal methods” in the pipeline, but it seems that way from what’s been said.