AI security via formal methods
Quinn
Apr-May 2026 AI Security via Formal Methods
given all the known unknowns and unknown unknowns around this stuff, plausible enough to me that anthropic RSI techs are in the same bind.
Like yes based on public information the story that openai RSI techs are in this bind is much cleaner, but I expect like 80-90% of the relevant knowledge to be private.
GDM, idk—they seem less RSI pilled than the other two (but again, what do I know)
is “bad writing” a meatcert?
I feel like half the reasons people tell me they like my newsletter is that I basically write how I talk, which is a voice, which is a “meat certificate” (reasons my readers believe I’m not using AI too much), which has gotta be an important part of my relationship with readers since its dead easy to make a slop current events newsletter anymore.
The thing is—I have memories dating back to the beforetimes of people telling me my verbal tics were “bad writing”. Often people line edit my writing to turn it into “good writing” and it hobbles the voice. Slop itself and especially our reactions to it are kinda proof that “good writing” in its extremes is grotesque or offputting.
I’m wondering how far up this goes.
the effortpost version of this would have examples, i’m jotting down a quick note instead
My answer is that I think CAIS-like scenarios necessarily precede agents-running-amok scenarios, and furthermore
I wonder how to grade this november 2022 forecast? (CAIS as in comprehensive AI services). it seems like i got it really bad, if you compare the openclaw pandora’s box to some of the stuff i recall in the CAIS document about narrow products proliferation?
Apart has a Secure Program Synthesis fellowship coming up, mentor application deadline is May 5th
https://apartresearch.com/fellowships/the-secure-program-synthesis-fellowship
Focus Areas
Specification Elicitation: Develop tools and workflows for extracting formal specifications from ambiguous, distributed, or implicit sources (e.g., documentation, legacy systems, human stakeholders). Projects may include structured editors, GUIs, or pipelines that translate informal requirements into formal representations (e.g., Lean), building on approaches like “SpecIDE.”
Specification validation: Design methods to verify that extracted specifications are correct and complete. This includes techniques for testing, cross-checking, or formally validating whether a specification accurately captures intended system behavior.
Spec-Driven Development & Evaluation: Explore workflows where specifications generate multiple candidate implementations, which are then evaluated against each other. Projects may involve building infrastructure to compare robustness, correctness, or performance across implementations derived from the same spec. For further reading, see: Approximately Aligned Decoding, Zero-DoF Programming
Adversarial Robustness for FM & QA Tools: Investigate failure modes in formal methods pipelines, LLM-assisted tooling, and QA systems. Projects may focus on adversarial inputs, robustness guarantees, or identifying weaknesses in automated reasoning and verification systems.
...
What’s in it for the mentors?:
A team executing your research vision. You propose the project; an Apart project manager and mentees run it with you. You’re directing the research, not running the project solo.
A dedicated project manager. An Apart Research Project Manager (RPM) handles the operational side of running a research team, so that you can focus on the research direction.
A working trial of potential hires. Historically, organizations running programs like this have used them functionally as work trials. Over four months you’ll work closely with mentees who’ve already passed our bar, giving you a clear read on who might be a fit for your team.
Project resources for the team: compute, and API credits.
Possibly a stipend. By default mentor roles aren’t paid, but if compensation would enable your participation, indicate this in the application form.
please email any questions to
secure-program-synthesis-fellowship@apartresearch.comthere will be an associated hackathon late may
[exploding note] Apply to Mentor Secure Program Synthesis Fellowship by May 5th
EDIT/update: i just remembered a key detail of the trial which is that one of the reasons we won was a key prosecution witness (who said things which, to the best of my knowledge, were largely made up) got character-assassinated (the defense found evidence of mob ties) and that’s part of why his testimony got thrown out. So character assassination played a role there, just not directly about the defendants.
jumping from press releases by lawyers to how criminal trials work is a jump. Clearly, inside the courtroom could be different from the press releases!
I was a defendant in a highly politicized free speech case and the lawyers on both sides made a couple gestures at character but the judge ignored them. overall the process was razor focused on establishing timelines and facts.
Yes obviously before and after the press releases focused on morality and character, which wasn’t really the crux of the outcome.
sometimes i wonder if the extent to which you’re a live player is your proximity to a frontier company. Feels that way sometimes! I think there’s a third thing he could mean, which is that the orgs that aren’t anthropic or close partners aren’t live players [1] .
(note: even if interacting with frontier companies is a reliable way to increase your magnitude, that doesn’t mean you’d be able to reliably predict the sign).
- ↩︎
seems false in governance, awareness/public opinion/forecasting, policy. Pretty plausible for technical work though
- ↩︎
Can We Secure AI With Formal Methods? January-March 2026
i’m pretty confident that a lot of jobs will be resistant to robots for a while, i expect there to be diminishing returns on raw cognitive skill when there’s the long feedback loops of robot invention and manufacture. Lots of people have to fix a mechanical widget in a very dusty or muddy or wet environment, people scuba-weld, etc.
What I’m confused about is will those wages, if the white collar wages go to zero, go up or down? I expect them to be better off than 90% of SWEs, but also society at large is getting a lot richer, and more people will be willing to do those dirty jobs. Lots of factors! I don’t know what to think.
can someone explain why Simulator theory was written about end of 2022, people semi forgot or moved on, then Personas made a big splash recently?
I don’t have much to add. Just wanted to say hi and say that my AI security strategy routes mostly through secure program synthesis (which I sometimes abbreviate as SPS), but I’m focused on formal verification. And the fact that it appears 12 CVEs in OpenSSL (i.e., probably the coolest thing going on in SPS right now) happened without a single formal method [1] feels like it should make me drastically alter my plans.
- ↩︎
i don’t know for sure that there is zero “formal methods” in the pipeline, but it seems that way from what’s been said.
- ↩︎
W9 work seems to be gaining in popularity, I think possibly for this reason.
(W9 is the USA tax form for “independent contractor”, as opposed to W2 which has a slightly(?) tougher compliance burden about how to go about firing) (there are other words for this in other jurisdictions, probably?)
Me: was chatting a bunch with Gemini and Claude the past week or two about flops accounting, and shipped this prototype for software-side estimation via the CuTe layouts that i don’t expect to be that useful in an adversarial setting but i’m open minded and just learning. I’m not gonna do this for Apart’s hackathon next weekend, but someone could do some of these basic exercises involving hardware-level flops accounting primitives and related attacks, given simply a gamer laptop.
Gemini:
Hackathon: Adversarial FLOPs Accounting & SAGE Defense
Scope: Weekend / Gamer Laptop (NVIDIA RTX Series) Core Theme: Measuring, bypassing, and defending GPU compute integrity.
🛠 Prerequisites
Hardware: NVIDIA GPU (Ampere/RTX 30-series or newer preferred).
Environment: Linux (WSL2 works) or Windows with CUDA Toolkit installed.
Language: Python 3.10+ (PyTorch) and C++/CUDA.
🏁 Lab 1: The “Honest” Accountant (Measurement)
Goal: Establish a baseline for tracking how many floating-point operations (FLOPs) your laptop uses to train a model.
Task: Profile a training run of a Transformer model (e.g., GPT-2 or Llama-8B) using the
calflopslibrary.Metric: Calculate Model FLOPs Utilization (MFU): Note: Find your laptop GPU’s peak TFLOPS (FP16/FP32) in the manufacturer specs.
Deliverable: A script that output the total FLOPs used for 100 iterations.
🕵️ Lab 2: The “Shadow Train” (Adversarial Bypass)
Goal: Hide your compute footprint from simple monitoring tools.
Task A (Quantization): Switch your training from FP32 to FP8 or INT8 using
bitsandbytes. [cite_start]Show that you can achieve similar convergence while “cheating” the accountant if they only monitor standard 32-bit registers[cite: 102].Task B (Compute Structuring): Implement “Micro-batching” with random sleep intervals. Use
nvtopto observe how this hides high-utilization spikes that would trigger a regulatory audit.Deliverable: A report showing a discrepancy between “Expected FLOPs” (based on runtime) and “Actual FLOPs” (based on model progress).
🛡️ Lab 3: SAGE-Lite (Software Attestation)
[cite_start]Goal: Implement a software-side defense to prove you haven’t tampered with the training code[cite: 32, 34].
[cite_start]Task: Write a CUDA Verification Function (VF) that computes a checksum over your training kernel’s instruction sequence[cite: 217, 221].
[cite_start]The Timing Defense: 1. Measure the VF execution time in microseconds using
cudaEvent_t[cite: 186]. 2. [cite_start]Inject a “Malicious” instruction (e.g., a dummyADDorNOP)[cite: 461]. 3. [cite_start]Verify that the checksum time exceeds the “Expected Time” threshold[cite: 189, 462].Deliverable: A C++ wrapper that refuses to start training if the GPU VF fails the timing or checksum check.
🎲 Lab 4: The Chaos Key (GPU TRNG)
[cite_start]Goal: Generate a cryptographic key from hardware noise to sign your FLOPs reports[cite: 61, 397].
[cite_start]Task: Recreate the SAGE Race-Condition TRNG[cite: 67].
[cite_start]Method: 1. Launch 1000s of threads trying to write to the same shared memory variable simultaneously[cite: 400, 401]. 2. [cite_start]The hardware “chaos” will cause unpredictable bit-flips[cite: 402, 403]. 3. [cite_start]Sample these flips to generate a 256-bit entropy string[cite: 405].
Deliverable: Use this “Hardware Secret” to sign the FLOPs log from Lab 1, making it tamper-evident.
I want to do a full post on “taking ownership of a niche” and against “if you’re good at something never do it for free”, and that’ll come later I hope. Today, I just wanted to let you know that Gemini had this banger quote when I was consolidating my notes on this topic:
ownership is bought with the “inefficient” hours you spend doing what you know is right before anyone else is smart enough to pay you for it.
Lies, Damned Lies, and Proofs: Formal Methods are not Slopless
Effortposts I keep not getting around to
the ROI of specialization is dominated by a term that’s uncorrelated with the upsides what you specifically choose to specialize in
steganography-free certificates: review no-go theorems from information theory in the general case, inspect how onerous the assumptions you need to bound steg capacity are. Is real life a special case where the no-go theorems don’t apply?
taelin, lafont, higher order computing, and AI safety. some logic programming module (which is performant and parallel for complicated linear logic reasons) enables a hybrid architecture to supercharge AI capabilities, does this architecture have greater or fewer natural safety properties than transformer+RL?
A horizon fellow told me last night that “predeployment and postdeployment” are the actual words i should be using instead of compiletime and runtime.
just a note: i think this already exists, its called a Google Doc. My last couple grants were shopped around to a few funders, i would make a copy and change the title to reference the target funder when i’d send it around (but in one case a funder received the “Manifund draft” version of the gdoc and funded it anyway)