Quinn
I was thinking the same thing this morning! My main thought was, “this is a trap. ain’t no way I’m pressing a big red button especially not so near to petrov day”
GSAI paper club is tomorrow (gcal ticket), summary (by me) and discussion of this paper
Alas, belief is easier than disbelief; we believe instinctively, but disbelief requires a conscious effort.
Yes, but this is one thing that I have felt being mutated as I read the sequences and continued to hang out with you lot (roughly 8 years ago, with some off and on)
By all means. Happy for that
Note in August 2024 GSAI newsletter
See Limitations on Formal Verification for AI Safety over on LessWrong. I have a lot of agreements, and my disagreements are more a matter of what deserves emphasis than the fundamentals. Overall, I think the Tegmark/Omohundro paper failed to convey a swisscheesey worldview, and sounded too much like “why not just capture alignment properties in ‘specs’ and prove the software ‘correct’?” (i.e. the vibe I was responding to in my very pithy post). However, I think my main reason I’m not using Dickson’s post as a reason to just pivot all my worldview and resulting research is captured in one of Steve’s comments:
I’m focused on making sure our infrastructure is safe against AI attacks.
Like, a very strong version I almost endorse is “GSAI isn’t about AI at all, it’s about systems coded by extremely powerful developers (which happen to be AIs)”, and ensuring safety, security, and reliability capabilities scale at similar speeds with other kinds of capabilities.
It looks like one can satisfy Dickson just by assuring him that GSAI is a part of a swiss cheese stack, and that no one is messianically promoting One Weird Trick To Solve Alignment. Of course, I do hope that no one is messianically promoting One Weird Trick…
august 2024 guaranteed safe ai newsletter
in case i forgot last month, here’s a link to july
A wager you say
One proof of concept for the GSAI stack would be a well-understood mechanical engineering domain automated to the next level and certified to boot. How about locks? Needs a model of basic physics, terms in some logic for all the parts and how they compose, and some test harnesses that simulate an adversary. Can you design and manufacture a provably unpickable lock?
Zac Hatfield-Dodds (of hypothesis/pytest and Anthropic, was offered and declined authorship on the GSAI position paper) challenged Ben Goldhaber to a bet after Ben coauthored a post with Steve Omohundro. It seems to resolve in 2026 or 2027, the comment thread should get cleared up once Ben gets back from Burning Man. The arbiter is Raemon from LessWrong.
Zac says you can’t get a provably unpickable lock on this timeline. Zac gave (up to) 10:1 odds, so recall that the bet can be a positive expected value for Ben even if he thinks the event is most likely not going to happen.
For funsies, let’s map out one path of what has to happen for Zac to pay Ben $10k. This is not the canonical path, but it is a path:
Physics to the relevant granularity (question: can human lockpicks leverage sub-newtownian issues?) is conceptually placed into type theory or some calculus. I tried a riemann integral in coq once (way once), so it occurs to me that you need to decide if you want just the functional models (perhaps without computation / with proof irrelevance) in your proof stack or if you want the actual numerical analysis support in there as well.
Good tooling, library support, etc. around that conceptual work (call it mechlib) to provide mechanical engineering primitives
A lock designing toolkit, depending on mechlib, is developed
Someone (e.g. a large language model) is really good at programming in the lock designing toolkit. They come up with a spec L.
You state the problem “forall t : trajectories through our physics simulation, if L(t) == open(L) then t == key(L)”
Then you get to write a nasty gazillion line Lean proof
Manufacture a lock (did I mention that the design toolkit has links to actual manufacturing stacks?)
Bring a bunch to DefCon 2027 and send another to the lockpicking lawyer
Everyone fails. Except Ben and the army of postdocs that $9,999 can buy.
Looks like after the magnificent research engineering in steps 1 and 2, the rest is just showing off and justifying those two steps. Of course, in a world where we have steps 1 and 2 we have a great deal of transformative applications of formal modeling and verification just in reach, and we’ll need a PoC like locks to practice and concretize the workflow.
Cryptography applications tend to have a curse of requiring a lot of work after the security context, permission set, and other requirements are frozen in stone, which means that when the requirements change you have to start over and throw out a bunch of work (epistemic status: why do you think so many defi projects have more whitepapers than users?). The provably unpickable lock has 2 to 10 x that problem– get the granularity wrong in step one, most of your mechlib implementation won’t be salvageable. As the language model iterates on the spec L in step 5, the other language model has to iterate on the proof in step 6, because the new spec will break most of the proof.
Sorry I don’t know any mechanical engineering, Ben, otherwise I’d take some cracks at it. The idea of a logic such that its denotation is a bunch of mechanical engineering primitives seems interesting enough that my “if it was easy to do in less than a year someone would’ve, therefore there must be a moat” heuristic is tingling. Perhaps oddly, the quantum semantics folks (or with HoTT!) seem to have been productive, but I don’t know how much of that is translatable to mechanical engineering.
Reinforcement learning from proof assistant feedback, and yet more monte carlo tree search
The steps are pretraining, supervised finetuning, RLPAF (reinforcement learning from proof assistant feedback), and MCTS (monte carlo tree search). RLPAF is not very rich: it’s a zero reward for any bug at all and a one for a happy typechecker. Glad they got that far with just that.
You can use the model at deepseek.com.
Harmonic ships their migration of miniF2F to Lean 4, gets 90% on it, is hiring
From their “one month in” newsletter. “Aristotle”, which has a mysterious methodology since I’ve only seen their marketing copy rather than an arxiv paper, gets 90% on miniF2F 4 when prompted with natural language proofs. It doesn’t look to me like the deepseek or LEGO papers do that? I could be wrong. It’s impressive just to autoformalize natural language proofs, I guess I’m still wrapping my head around how much harder it is (for an LLM) to implement coming up with the proof as well.
Jobs: research engineer and software engineer
Atlas ships their big google doc alluded to in the last newsletter
Worth a read! The GSAI stack is large and varied, and this maps out the different sub-sub-disciplines. From the executive summary:
You could start whole organizations for every row in this table, and I wouldn’t be a part of any org that targets more than a few at once for fear of being unfocused. See the doc for more navigation (see what I did there? Navigating like with an atlas, perhaps? Get it?) of the field’s opportunities.[1]
Efficient shield synthesis via state-space transformation
Shielding is an area of reactive systems and reinforcement learning that marks states as unsafe and synthesizes a kind of guarding layer between the agent and the environment that prevents unsafe actions from being executed in the environment. So in the rejection sampling flavored version, it literally intercepts the unsafe action and tells the agent “we’re not running that, try another action”. One of the limitations in this literature is computational cost, shields are, like environments, state machines plus some frills, and there may simply be too many states. This is the limitation that this paper focuses on.
Besides cost, demanding a lot of domain knowledge is another limitation of shields, so this is an especially welcome development.
Funding opportunities
ARIA jumped right to technical area three (TA3), prototyping the gatekeeper. Deadline October 2nd. Seems geared toward cyber-physical systems folks. In the document:
Note that verified software systems is an area which is highly suitable for a simplified gatekeeper workflow, in which the world-model is implicit in the specification logic. However, in the context of ARIA’s mission to “change the perception of what’s possible or valuable,” we consider that this application pathway is already perceived to be possible and valuable by the AI community. As such, this programme focuses on building capabilities to construct guaranteed-safe AI systems in cyber-physical domains. That being said, if you are an organisation which specialises in verified software, we would love to hear from you outside of this solicitation about the cyber-physical challenges that are just at the edge of the possible for your current techniques.
This is really cool stuff, I hope they find brave and adventurous teams. I had thought gatekeeper prototypes would be in minecraft or mujoco (and asked a funder if they’d support me in doing that), so it’s wild to see them going for actual cyberphysical systems so quickly.
Paper club
Add to your calendar. On September 19th we will read a paper about assume-guarantee contracts with learned components. I’m liable to have made a summary slide deck to kick us off, but if I don’t, we’ll quietly read together for the first 20-30 minutes then discuss. The google meet room in the gcal event by default.
Andrew Dickson’s excellent post
See Limitations on Formal Verification for AI Safety over on LessWrong. I have a lot of agreements, and my disagreements are more a matter of what deserves emphasis than the fundamentals. Overall, I think the Tegmark/Omohundro paper failed to convey a swisscheesey worldview, and sounded too much like “why not just capture alignment properties in ‘specs’ and prove the software ‘correct’?” (i.e. the vibe I was responding to in my very pithy post). However, I think my main reason I’m not using Dickson’s post as a reason to just pivot all my worldview and resulting research is captured in one of Steve’s comments:
I’m focused on making sure our infrastructure is safe against AI attacks.
Like, a very strong version I almost endorse is “GSAI isn’t about AI at all, it’s about systems coded by extremely powerful developers (which happen to be AIs)”, and ensuring safety, security, and reliability capabilities scale at similar speeds with other kinds of capabilities.
It looks like one can satisfy Dickson just by assuring him that GSAI is a part of a swiss cheese stack, and that no one is messianically promoting One Weird Trick To Solve Alignment. Of course, I do hope that no one is messianically promoting One Weird Trick…
- ^
One problem off the top of my head regarding the InterFramework section: Coq and Lean seems the most conceptually straightforward since they have the same underlying calculus, but even there just a little impredicativity or coinduction could lead to extreme headaches. Now you can have a model at some point in the future that steamrolls over these headaches, but then you have a social problem of the broader Lean community not wanting to upstream those changes– various forks diverging fundamentally seems problematic to me, would lead to a lot of duplicated work and missed opportunities for collaboration. I plan to prompt Opus 3.5 with “replicate flocq in lean4” as soon as I get access to the model, but how much more prompting effort will it be to ensure compliance with preexisting abstractions and design patterns, so that it can not only serve my purposes but be accepted by the community? At least there’s no coinduction in flocq, though some of the proofs may rely on set impredicativity for all I know (I haven’t looked at it in a while).
I do think it’s important to reach appropriately high safety assurances before developing or deploying future AI systems which would be capable of causing a catastrophe. However, I believe that the path there is to extend and complement current techniques, including empirical and experimental approaches alongside formal verification—whatever actually works in practice.
for what it’s worth, I do see GSAI as largely a swiss cheesey worldview. Though I can see how you might read some of the authors involved to be implying otherwise! I should knock out a post on this.
not just get it to sycophantically agree with you
i struggle with this, and need to attend a prompting bootcamp
i’m getting back into composing and arranging. send me rat poems to set to music!
sure—i agree that’s why i said “something adjacent to” because it had enough overlap in properties. I think my comment completely stands with a different word choice, I’m just not sure what word choice would do a better job.
I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I’d prefer to avoid lock-in to my or anyone’s values at t=2024, but such a lock-in might be “good enough” if I’m threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I’d accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps).
In other words, I can’t beat the old fragility of value stuff (but I haven’t tried in a while).
I wrote down my full thoughts on good successor criteria in 2021 https://www.lesswrong.com/posts/c4B45PGxCgY7CEMXr/what-am-i-fighting-for
AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don’t think this anymore.
Thinking about a top-level post on FOMO and research taste
Fear of missing out defined as inability to execute on a project cuz there’s a cooler project if you pivot
but it also gestures at more of a strict negative, where you think your project sucks before you finish it, so you never execute
was discussing this with a friend: “yeah I mean lesswrong is pretty egregious cuz it sorta promotes this idea of research taste as the ability to tear things down, which can be done armchair”
I’ve developed strategies to beat this FOMO and gain more depth and detail with projects (too recent to see returns yet, but getting there) but I also suspect it was nutritious of me to develop discernment about what projects are valuable or not valuable for various threat models and theories of change (in such a way that being a phd student off of lesswrong wouldn’t have been as good in crucial ways, tho way better in other ways).
but I think the point is you have to turn off this discernment sometimes, unless you want to specialize in telling people why their plans won’t work, which I’m more dubious on the value of than I used to be
Idk maybe this shortform is most of the value of the top level post
He had become so caught up in building sentences that he had almost forgotten the barbaric days when thinking was like a splash of color landing on a page.
a -valued quantifier is any function , so when is bool quantifiers are the functions that take predicates as input and return bool as output (same for prop). the standard
max
andmin
functions on arrays count as real-valued quantifiers for some index set .I thought I had seen as the max of the Prop-valued quantifiers, and exists as the min somewhere, which has a nice mindfeel since forall has this “big” feeling (if you determined for that (of which is just syntax sugar since the variable name is irrelevant) by exhaustive checking, it would cost whereas would cost unless the derivation of the witness was dependent on size of domain somehow).
Incidentally however, in differentiable logic it seems forall is the “minimal expectation” and existential is the “maximal expectation”. Page 10 of the LDL paper, where a is the limit as gamma goes to zero of , or the integral with respect to a -ball about the min of rather than about the entire domain of . os in this sense, the interpretation of a universally quantified prop is a minimal expectation, dual where existentially quantified prop is a maximal expectation.
I didn’t like the way this felt aesthetically, since as I said, forall feels “big” which mood-affiliates toward a max. But that’s notes from barely-remembered category theory I saw once. Anyway, I asked a language model and it said that forall is minimal because it imposes the strictest of “most conservative” requirement. so “max” in the sense of “exists is interpreted to maximal expectation” refers to maximal freedom.
I suppose this is fine.
I’m excited for language model interpretability to teach us about the difference between compilers and simulations of compilers. In the sense that chatgpt and I can both predict what a compiler of a suitably popular programming language will do on some input, what’s going on there---- surely we’re not reimplementing the compiler on our substrate, even in the limit of perfect prediction? Will be an opportunity for a programming language theorist in another year or two of interp progress
firefox is giving me a currently broken redirect on https://atlascomputing.org/safeguarded-ai-summary.pdf
Does anyone use vim / mouse-minimal browser? I like Tridactyl better than the other one I tried, but it’s not great when there’s a vim mode in a browser window everything starts to step on eachother (like in jupyter, colab, leetcode, codesignal)
claude and chatgpt are pretty good at ingesting textbooks and papers and making org-drill cards.
here’s my system prompt https://chat.openai.com/g/g-rgeaNP1lO-org-drill-card-creator though i usually tune it a little further per session.
Here are takes on the idea from the anki ecosystem
I tried a little ankigpt and it was fine, i haven’t tried the direct plugin from ankiweb. I’m opting for org-drill here cuz I really like plaintext.
did anyone draw up an estimate of how much the proportion of code written by LLMs will increase? or even what the proportion is today