Quinn
I would encourage a taboo on “independently wealthy”, I think it’s vague and obscurantist, and doesn’t actually capture real life runway considerations. “How long can I sustain which burn rate, and which burn rate works with my lifestyle?” is the actual question!
There’s a remarkable TNG episode about enfeeblement and paul-based threatmodels, if I recall correctly.
There’s a post-scarcity planet with some sort of Engine of Prosperity in the townsquare, and it doesn’t require maintenance for enough generations that engineering itself is a lost oral tradition. Then it starts showing signs of wear and tear...
If paul was writing this story, they would die. I think in the actual episode, there’s a disagreeable autistic teenager who expresses curiosity about the Engine mechanisms, and the grownups basically shame him, like “shut up and focus on painting and dancing”. I think the Enterprise crew bails them out by fixing the Engine, and leaving the kid with a lesson about recultivating engineering as a discipline and a sort of intergenerational cultural heritage and responsibility.
I probably saw it over 10 years ago, I haven’t looked it up yet. Man, this is a massive boon to the science-communication elements of threatmodeling, given that the state of public discussion seems to be little middle ground between unemployment and literally everyone literally dying. We can just point people to this episode! Any thoughts?
Mad respect for the post. Disagree with your background free speech / society philosophy, re the protestors:
Magnanimously and enthusiastically embracing the distribution of views and tactics does not entail withholding criticism. It sorta reminds me of the mistake of forecasting conditional on your own inaction, forgetting that at every time step you (like other agents) will be there responding to new information and propagating your beliefs and adjusting your policies. You’re a member of the public, too! You can’t just recuse yourself.
I also think democracy hinges crucially on free speech, and I think the world will function better if people don’t feel shut-down or clammed-up by people-like-me saying “the remainder of May 2023 probably isn’t a great time for AI protests.”
This sentence is bizarre! People who take notice and want to help, feeling the protest impulse in their heart, deserve peer review, in principle, period. Obviously there are tactical or strategic complications, like discursive aesthetics or information diets / priors and other sources of inferential distance, but the principle is still true!
As a wee lad, I was compelled more by relative status/wealth than by absolute status/wealth. It simply was not salient to me that a bad gini score could in principle be paired with negligible rates of suffering! A healthy diet of ourworldindata, lesswrong, and EA forum set me straight, eventually; but we have to work with the actually existing distribution of ideologies and information diets.
I think people who reject moral circle expansion to the future (in the righteous sense: the idea that only oppressors would undervalue more obvious problems) are actually way more focused on this crux (relative vs absolute) than on the content of their population ethics opinions.
Ideally there would be an exceedingly high bar for strategic witholding of worldviews. I’d love some mechanism for sending downvotes to the orgs that veto’d their staff from participating! I’d love some way of socially pressuring these orgs into at least trying to convince us that they had really good reasons.
I’m pretty cynical: I assume nervous and uncalibrated shuffling of HR or legal counsel is more likely than actual defense against hazardous leakage of, say, capabilities hints.
There exists a school of thought that looks roughly like “the dating docs people are stupid; top-down design of selection effects can obviously never work, plugging the monkey brain into the meat market is a better strategy for arbitrary dating goals” that I became more sympathetic to, but I’m still deeply resistant to adopting it completely.
I’m not really being scientific, I’m much more interested in “leading by example” and paving the kind of dating culture that I think is superior relative to my values than I am in updating my beliefs against the unfeeling stone of Reality.
To be fair, me and someone tried a “select on building properties, try to cultivate entertaining properties later” strategy.
We read eachothers’ dating docs, calculated an optimism-inducing amount of goal alignment and compatibility, and took a crack at being charming and funny and sexy to eachother.
It did not go as planned. I was a little shocked—surely my monkeybrain needs would cooperate with (or be coerced into aligning with) my actual life goals, right? With hindsight I’m kind of honing my ability to recognize “just reverse engineer the ‘spark’, how hard can it be” as a special kind of stupid.
I get the vibe that Conjecture doesn’t have forecasting staff, or a sense of iterating on beliefs about the future to update strategy. I sorta get a vibe that Conjecture is just gonna stick with their timelines until new years day 2028 and if we’re not all dead, write a new strategy based on a new forecast. Is this accurate?
I think there’s a kind of division of labor going on, and I’m going to use a software industry metaphor. If you’re redteaming, auditing, or QAing at a large org, you should really be maxing out on paranoia, being glass half empty, etc. because you believe that elsewhere in the institution, other peoples’ jobs are to consider your advice and weigh it against the risk tolerance implied by the budget or by regulation or whatever. Whereas I think redteaming, auditing, or QAing at a small org you kind of take on some of the responsibility of measuring threatmodels against given cost constraints. It’s not exactly obvious that someone else in the org will rationally integrate information you provide into the organization’s strategy and implementation, you want them to follow your recommendations in a way that makes business sense.
My guess is that being a downer comes from this large org register of a redteam’s job description being literally just redteam, whereas it might make sense for other researchers or communicators to take a more small org approach where the redteam is probably multitasking in some way.
Intuition pump: I don’t really know a citation, but I once saw a remark that the commercial airline crash rate in the late soviet union was plausibly more rational than the commercial airline crash rate in the US. Airplane risk intolerance in the US is great for QA jobs, but that doesn’t mean it’s based on an optimal tradeoff between price and safety with respect to stakeholder preferences (if you could elicit them in some way). Economists make related remarks re nuclear energy.
I think a coq bug or the type of mistakes that are easy to make in formalization are more likely than a transistor failure or a bad bit flip bubbling up through the stack, but still not what I’m talking about.
In the background, I’m thinking a little about this terry tao post, about how the process of grokking the proof (and being graceful with typos, knowing which typos are bad and which are harmless, and so on) is where the state of the art mathematics lies, not in the proof itself.
I was discussing your comment with a friend, who suggested what I’m calling factored cognition parallel auditing (FCPA): she asked why don’t we just divide the 1e6 lines of coq into 1000-line segments and send it out to 1000 experts each of whom make sense of their segment? The realities would be a little messier than linearly stitching together segments, but this basic setup (or something that emphasizes recursion, like a binary search flavored version) would I think buy us roughly 2-3x the assurance that RH true over just blindly trusting the type theory / laptop to execute
Riemann.v
in the coq session.In the current comment, let’s assume a background of russell-like “enfeeblement” or christiano-like “losing control of the world/future in a slow burn”, filtering out the more yudkowsky-like threatmodels. Not to completely discount malicious injections in
Riemann.v
, but they don’t seem productive to emphasize. I will invoke the vague notions of “theorem intricacy” and “proof length”, but it should be possible to not read this footnote[1] and still follow along with the goals of the current comment.This isn’t really about RH, RH isn’t necessarily important, etc. I was actually alluding to / vaguely criticizing a few different ideas at once in the paragraph you quoted.
The proof assistants or applied type theory technologies will help agent foundations researchers progress and/or teach. I’m bearish on this largely because I think agent foundations researchers care more about the aforementioned Tao blogpost than about the Kevin Buzzard side of the aisle (recall that Kevin Buzzard is the guy leading the charge to put all math department operations on Lean, from undergrad coursework to the state of the art). I don’t think FCPA would help with Tao’s notion of mathematical sophistication and understanding, barring pretty drastic advances in HCI/BCI. If theorem intricacy and proof length escapes us in agent foundations, formalization doesn’t solve many of the problems that would present.
Alignment itself turns out to be deeply inelegant. In these worlds, we may be using a flavor of codegen to even write down the theorem/spec, not to mention the proofs. Let’s assume type theory is secure (no leak or unbox related attack surface) and trustworthy (no false positives or false negatives). Then, I don’t know how to beat FCPA at arbitrary theorem intricacy and proof size. Maybe a massive FCPA regime could “understand alignment” in an “institutions as distributed cognition” sense. I find this unsatisfying, but it may be the best we can do.
The LLM does alignment research for us, and communicates it to us via type theory (there are flavors of this approach that do not invoke type theory, and I think those are even worse). Even when we assume type theory is secure and trustworthy, we have the same problems I highlighted in the preceding list entries. But SGD and the programs it writes are untrusted, so the standard of grokking the old fashion way should be really high. Trusting an auditing process like FCPA to be adequate here feels really dicey to me.
I think we should be sensitive to “which parts of the proof are outsourced to SGD or the programs it writes?” as a potential area of research acceleration. I should be willing to be convinced that parts of the proof are harmless to automate or not pay close attention to, deferring to the type theory or some proof automation regime that may even involve SGD in some way.
If we’re relying on computational processes we don’t understand, we still get enfeeblement and we still risk losing control of the world/future over time even if type theory is a bajillion times more secure and trustworthy than SGD and the programs it writes.
I hope that makes my worldview more crisp.
- ↩︎
There’s not a widely agreed upon metric or proxy for theorem size, I’m intentionally using a wishy-washy word like “intricacy” to keep the reader from taking any given metric or proxy too seriously. I think the minimum viable quantitative view is basically AST size, and it’s ok to be unsatisfied about this. Proof length is sort of similar. A proof object is kind of like a lambda term, it’s size can be thought of as tree size. Both of these are not “basis independent”, each of these measurements will be sensitive to variation in syntactic choices, but that probably doesn’t matter. There’s also a kind of kolmogorov complexity point of view, which I think usually prefers to emphasize character count of programs over anything to do with the tree perspective. A particular kind of constructivist would say that theorem intricacy of is length of the shortest proof of , but I think there are benefits to separating statement from proof, but that would take me too far afield. A turing machine story would emphasize different things but not be qualitatively different.
I like your strategy sketch.
There’s the guard literature in RL theory which might be nice for step one.
I agree with step two, favoring non-learned components over learned components whenever possible. I deeply appreciate thinking of learned components as an attack surface; shrink it when you can.
Really excited about step three. Estimation and evaluation and forecasting is deeply underrated. It’s weird to say because relative to the rest of civilization the movement overrates forecasting, and maybe the rest of the world will be vindicated and forecasting will go the way of esperanto, but I don’t think we should think in relative terms. Anyway, I feel like if you had the similarity metric and control that you want for this, it’d be easier to just say “don’t do anything drastic” in a blanket way (of course competitive pressures to be smarter and bolder than anyone else would ruin this anyway).
Step four is more oof an implementation reality than a part of the high level strategy, but yeah.
Thanks a ton for the comment, it’s exactly the type of thing I wanted my post to generate.
Agree. The product stack has non-learned components in it. Let’s throw the kitchen sink at those components.
Some techniques that go under the banner of formal verification are not applied type theory (deduction), usually in the model checking family, or maybe with abstract interpretation. For the record I only have my applied type theory experience, I’m a coq dev, I’ve spent a cumulative like two hours over the past 3 years pondering abstract interpretation or model checking. That’s why I said “formal proof” in the title instead of “formal verification”. I could’ve called it “proof engineering” or “applied type theory” maybe, in the title, but I didn’t.
In order for there to be alpha in applied type theory for the learned components, a great deal of things need to fall into place mostly in mechinterp, then some moderate applied type theory work, then a whole lot of developer experience / developer ecosystems work. That’s what I believe.
most codegen techniques assume a margin of error until proven otherwise, that includes extraction from coq to ocaml or haskell and lots of program synthesis techniques I’ve heard about. Extraction (like that from coq to ocaml or haskell) is probably the furthest away from SGD I think you can get on “the codegen spectrum”.
Sorry for abbreviating. Mechinterp in my post refers to the entire field of mechanistic interpretability (the statistically learned program’s analogue to reverse engineering a
.c
file from an executable produced by a c compiler) not to a specific post.
Intuitively, asking the LLM to translate from english to coq is a cousin of “ask the LLM to summarize the coq into english”, via security mindset. What assurances can we get about the translation process? This seems dicey and hard. I’m imagining that there’s an auditing game design out there that would give us some hope though.
Executable textbooks are fun but the odds that playing one cultivates the parts of alignment research that are hard or important seem slim to me.
Crucially to your comment though, what work is “implicitly” doing? Implicitly is where most preparadigmatic work happens! This is loosely related to the false sense of security. Formalization starts when you admit your assumptions as a premise. Formal methods do not second guess assumptions.
If I recall correctly the Cardano team’s paper on agile formal methods may have had some comments on this, but I’m not 100% sure.
We need a cool one-word snappy thing to say for “just what do you think you know and how do you think you know it” or like “I’m requesting more background about this belief you’ve stated, if you have time”.
I want something that has the same mouthfeel as “roll to disbelieve” for this.
I don’t have an overarching theory of the Hard Problem of Jargon, but I have some guesses about the sorts of mistakes people love to make. My overarching point is just “things are hard”
This is a deeply rare phenomenon. I do think there are nonzero places with a peculiar mix of prestige and thinness of kayfabe that lead to this actually happening (like if you’re maintaining a polite fiction of meritocracy in the face of aggressive nepotism, you might rely on cheap superiority signals to nudge people into not calling BS), or in a different way I remember at when I worked at home depot supervisors may have been protecting their $2/hr pay bump by hiding their responsibilities from their subordinates (to prevent subordinates from figuring out that they could handle actually supervising if the hierarchy was disturbed). Generalizing from these scenarios to scientific disciplines is perfectly silly! Most people, a vaster majority in sciences, are extremely excited about thinking clearly and communicating clearly to as many people as possible!
I also want to point out a distinction you may be missing in anti-finance populism. A synthetic CDO is sketchy because it is needlessly complex by it’s nature, not that the communication strategy was insufficiently optimized! But you wrote about “unnecessary jargon”, implying that you think implementing and reasoning about synthetic CDOs is inherently easy, and finance workers are misleading people into thinking it’s hard (because of their scarcity mindset, to protect their job security, etc). Jargon is an incredibly weak way to implement anti-finance populism, a stronger form of it says that the instruments and processes themselves are overcomplicated (for shady reasons or whatever).
Moreover, emphasis on jargon complaints implies a destructive worldview. The various degrees and flavors of “there are no hard open problems, people say there are hard open problems to protect their power, me and my friends have all the answers, which were surprisingly easy to find, we’ll prove it to you as soon as you give us power” dynamics I’ve watched over the years seem tightly related, to me.
I do get frustrated when people tell me that “clear writing” is one thing that definitely exists, because I think they’re ignoring tradeoffs. “How many predictable objections should I address, is it 3? 6? does the ‘clear writing’ protocol tell me to roll a d6?” sort of questions get ignored. To be fair, Arbital was initially developed to be “wikipedia with difficulty levels”, which would’ve made this easier.
TLDR
I think the way people should reason about facing down jargon is to first ask “can I help them improve?” and if you can’t then you ask “have they earned my attention?”. Literally everywhere in the world, in every discipline, there are separate questions for communication at the state of the art and communication with the public. People calculate which fields they want to learn in detail, because effort is scarce. Saying “it’s a problem that learning your field takes effort” makes zero sense.