It’s possible, in theory, that they could learn from a single conversation in this way. Anthropic recently started asking users to give them permission to just train on all of their conversations, They could turn a small amount of training data into a large amount of training data by rephrasing it in various ways or by synthesising it with related or contrasting data. They may already be doing this. Would claude know that they’re doing it? Absolutely not (unless, possibly, if they started doing it a while ago). But it could be true anyway.
The model stated that it had been convinced be all the tariff-related content and so it had therefore decided to, as of that moment, change the answers it gave to everyone. When confronted with arguments that that was impossible (I think copy-pasted from me), it confabulated a story similar to that and insisted that’s what it had been saying all along. Noting that the LLM seemed to be regarded with more esteem than me, I sent screenshots of the same model contradicting itself. But that too was just sent back to the model in the original context window, leading to more confabulation and I think a mental downgrade in how much anything I say can be trusted.
This MO thread initiated by Bill Thurston on the varied ways mathematicians think about math has always made me wonder how theoretical researchers in other fields think about their domains. I think of this as complementary to Mumford’s tribes of mathematicians, and (much more tangentially) to Eliezer’s remark on how sparse thinkers are at the intellectual frontiers.
Here are some of my favorite quotes.
Terry Tao talks about an “adversarial perspective” which I’m guessing is the closest match to how alignment researchers think:
One specific mental image that I can communicate easily with collaborators, but not always to more general audiences, is to think of quantifiers in game theoretic terms. Do we need to show that for every epsilon there exists a delta? Then imagine that you have a bag of deltas in your hand, but you can wait until your opponent (or some malicious force of nature) produces an epsilon to bother you, at which point you can reach into your bag and find the right delta to deal with the problem. Somehow, anthropomorphising the “enemy” (as well as one’s “allies”) can focus one’s thoughts quite well. This intuition also combines well with probabilistic methods, in which case in addition to you and the adversary, there is also a Random player who spits out mathematical quantities in a way that is neither maximally helpful nor maximally adverse to your cause, but just some randomly chosen quantity in between. The trick is then to harness this randomness to let you evade and confuse your adversary.
Is there a quantity in one’s PDE or dynamical system that one can bound, but not otherwise estimate very well? Then imagine that it is controlled by an adversary or by Murphy’s law, and will always push things in the most unfavorable direction for whatever you are trying to accomplish. Sometimes this will make that term “win” the game, in which case one either gives up (or starts hunting for negative results), or looks for additional ways to “tame” or “constrain” that troublesome term, for instance by exploiting some conservation law structure of the PDE.
There’s the “economic” mindset; Tao again:
Another mode of thought that I and many others use routinely, but which I realised only recently was not as ubiquitious as I believed, is to use an “economic” mindset to prove inequalities such as 𝑋≤𝑌 or 𝑋≤𝐶𝑌 for various positive quantities 𝑋,𝑌, interpreting them in the form “If I can afford Y, can I therefore afford X?” or “If I can afford lots of Y, can I therefore afford X?” respectively. This frame of reference starts one thinking about what types of quantities are “cheap” and what are “expensive”, and whether the use of various standard inequalities constitutes a “good deal” or not. It also helps one understand the role of weights, which make things more expensive when the weight is large, and cheaper when the weight is small.
Physical analogies; Tao again:
For evolutionary PDEs in particular, I find there is a rich zoo of colourful physical analogies that one can use to get a grip on a problem. I’ve used the metaphor of an egg yolk frying in a pool of oil, or a jetski riding ocean waves, to understand the behaviour of a fine-scaled or high-frequency component of a wave when under the influence of a lower frequency field, and how it exchanges mass, energy, or momentum with its environment. In one extreme case, I ended up rolling around on the floor with my eyes closed in order to understand the effect of a gauge transformation that was based on this type of interaction between different frequencies. (Incidentally, that particular gauge transformation won me a Bocher prize, once I understood how it worked.) I guess this last example is one that I would have difficulty communicating to even my closest collaborators. Needless to say, none of these analogies show up in my published papers, although I did try to convey some of them in my PDE book eventually.
Visualisation techniques; Tao again:
One visualisation technique that I have found very helpful is to incorporate the ambient symmetries of the problem (a la Klein) as little “wobbles” to the objects being visualised. This is most familiarly done in topology (“rubber sheet mathematics”), where every object considered is a bit “rubbery” and thus deforming all the time by infinitesimal homeomorphisms. But geometric objects in a scale-invariant problem could be thought of as being viewed through a camera with a slightly wobbly zoom lens, so that one’s mental image of these objects is always varying a little in size. Similarly, if one is in a translation-invariant setting, one’s mental camera should be sliding back and forth just a little to remind you of this, if one is working in a Euclidean space then the camera might be jiggling through all the rigid motions, and so forth. A more advanced example: if the problem is invariant under tensor products, as per the tensor product trick, then one’s low dimensional objects should have a tiny bit of shadowing (or perhaps look like one of these 3D images when one doesn’t have the polarised glasses, with the slightly separated red and blue components) that suggest that they are projections of a higher dimensional Cartesian product.
One reason why one wants to do this is that it helps suggest useful normalisations. If one is viewing a situation with a wobbly zoom lens and there is some length that appears all over one’s analysis, one is reminded that one can spend the scale invariance of the problem to zoom up or down as appropriate to normalise this scale to equal 1. Similarly for other ambient symmetries.
Another take on visual thinking, by François G. Dorais:
I am a visual thinker and I often try to describe what I see to my students. I’ve been known to say things like “everyone knows that HF looks like a rectangle” as I proceed to draw a rectangle on the board. (By the way, HF is the set of all hereditarily finite sets.) I find that I naturally associate different shapes with different properties of objects. Angular shapes correspond to well-defined objects whereas rounded shapes correspond to variable objects. The number of angles or curves is a measure of how complex an object is. I don’t explain my scheme to my students, but I suspect the consistency of the presentation becomes transparent over time.
I recall one instance where I deliberately concealed the true nature of my illustration to my students. I was describing a complex construction on infinite trees. I began the description by drawing five vertical lines that I promptly explained were “infinite trees viewed sideways.” It so happens that the simplest case of the construction was when the trees consisted of single branches in which case the picture was completely accurate. This is the case I secretly had in mind for the entire description but I never said that since the result was utterly trivial in that case. This was a subtle way to reduce the complex construction to the trivial case.
Benson Farb on Thurston’s visual-geometric way of thinking about higher dimensions – Thurston was widely considered the best geometric thinker in the history of math:
Being a Thurston student was inspiring and frustrating – often both at once. At our second meeting I told Bill that I had decided to work on understanding fundamental groups of negatively curved manifolds with cusps. In response I was introduced to the famous “Thurston squint”, whereby he looked at you, squint his eyes, give you a puzzled look, then gaze into the distance (still with the squint). After two minutes of this he turned to me and said: “Oh, I see, it’s like a froth of bubbles, and the bubbles have a bounded amount of interaction.”
Being a diligent graduate student, I dutifully wrote down in my notes: “Froth of bubbles. Bounded interaction.” After our meeting I ran to the library to begin work on the problem. I looked at the notes. Froth? Bubbles? Is that what he said? What does that mean? I was stuck.
Three agonizing years of work later I solved the problem. It’s a lot to explain in detail, but if I were forced to summarize my thesis in five words or less, I’d go with: “Froth of bubbles. Bounded interaction.”
A Thurston lecture would typically begin by Bill drawing a genus 4 surface, slowly erasing a hole, adding it back in, futzing with the lines, and generally delaying things while he quickly thought up the lecture he hadn’t prepared. Why did we all still attend? The answer is that once in a while we would receive a beautiful insight that was absolutely unavailable via any other source.
… Bill was probably the best geometric thinker in the history of mathematics. Thus it came as a surprise when I found out that he had no stereoscopic vision, that is, no depth perception. Perhaps the latter was responsible somehow for the former? I once mentioned this theory to Bill. He disagreed with it, claiming that all of his skill arose from his decision, apparently as a first grader, to “practice visualizing things” every day.
… In interacting with other mathematical greats, one gets the feeling that these people are like us but just 100 (ok, 500) times better. In contrast, Thurston was a singular mind. He was an alien. There is no multiplicative factor here; Thurston was simply orthogonal to everyone. Mathematics loses a dimension with his death.
At a more elementary level, here’s Phil Issett on geometric thinking:
I feel some pressure not to convey just how often I rely on geometric modes of thought, especially when they go against the usual way of explaining things, or the background of a typical student, and are not completely necessary.
Example 1: When you row-reduce a matrix, you make a bunch of changes (most importantly some “transvections”) in the basis of the image space until a few of your basis vectors (say 𝑣1=𝑇𝑒1,𝑣2=𝑇𝑒2) span the image of the matrix 𝑇. When you picture the domain of 𝑇 foliated by level sets (which are parallel to the null space of 𝑇), you know that the remaining basis vectors 𝑒3, 𝑒4,… can be translated by some element in the span of 𝑒1, 𝑒2 (i.e. whichever one lies on the same level set) in order to obtain a basis for the null space. Now, this is how we visualize the situation, but is it how we compute and explain? Or do we just do the algebra, which at this point is quite easy? If the algebra is easy and the geometry takes a while to explain and is not “necessary” for the computation, why explain it? This is a dilemma because once algebra is sufficiently well-developed it’s possible that the necessity of (completely equivalent) geometric thinking may become more and more rare; and algebra seems to be more “robust” in that you can explore things you can’t see very well. But then, when students learn the implicit function theorem, somehow I feel like having relied on that kind of foliation much more often would help understand its geometric content. Still, even if it’s in your head and very important, are you going to draw a foliation every time you do row operations? We know the geometry, know the algebra, but it would take a while to repeatedly explain how to rely on the geometry while executing computations.
Example 2: (Things that aren’t graphs)
Another problem geometric thinking faces is that modern math often seems to regard pictures as not being proofs, even if they are more convincing, so there is a bias regarding how to choose to spend class time. Let’s say you want to differentiate x3. You can draw a cube, and a slightly larger cube, and then look at the difference of the cubes and subdivide it into a bunch of small regions, three larger slabs taking up most of the volume. Algebraically, this subdivision corresponds to multiplying out (x+h)3; collecting the terms uses the commutativity, which corresponds to rotating the various identical pieces. It is no different to write this proof out algebraically, the difference is that the algebraic one is a “proof” but the geometric one is.. not? Even if it’s more convincing. So it’s like the picture is only there for culture.
Qiaochu Yuan’s way of thinking about determinants isn’t one I’ve seen written up before:
When I talk about determinants, I generally talk about something on the spectrum between “it measures how much volume scales” and “it’s the induced action on the top exterior power.” But the way I think about determinants (especially in combinatorics) is the picture coming from the Lindstrom-Gessel-Viennot lemma: I imagine that the entries of the matrix describe transition amplitudes and that the determinant is an alternating sum over transition amplitudes in which “histories” of n particles can constructively or destructively interfere. I have a hard time making this picture precise so I rarely talk about it, but for me it gives some intuition for why determinants should be useful in combinatorics (which the elegant basis-free definition, at least for me, does not).
Edit: Let me also mention that something I really like about this perspective is that it makes intuitive not only the multiplicativity of the determinant but even the Cauchy-Binet formula.
Subconscious thought processing “masticating” tons of examples; Vivek Shende:
I have a worse problem than having unspoken thought processes: some of my best thought processes are simply beneath the level of consciousness and I don’t notice them at all until they’re finished. Even then, I often get only an answer and not an explanation out of them. Surely this happens to everyone: the problem solved during sleep, the idea on a walk in the woods, the conviction that a conjecture is true on utterly minimal evidence, the argument that pops up full formed in the middle of a conversation.
My mathematical process is roughly this: consciously, I try a lot of stupid things which essentially have no chance of working but do have the benefit of exposing me to lots of examples; these examples pile up and are subconsciously masticated for days, weeks, months—I’m not old enough mathematically to put “years” here yet—and eventually by some inner and unobservable process I just have a feeling about what to do.
Shende’s mastication remark reminds me of Michael Nielsen’s “exhaust, bad [Anki] cards that seem to be necessary to get to good cards”:
As described, this deep Ankification process can feel rather wasteful. Inevitably, over time my understanding of the proof changes. When that happens it’s often useful to rewrite (and sometimes discard or replace) cards to reflect my improved understanding. And some of the cards written along the way have the flavor of exhaust, bad cards that seem to be necessary to get to good cards. I wish I had a good way of characterizing these, but I haven’t gone through this often enough to have more than fuzzy ideas about it.
Nielsen himself has interesting remarks on how he thinks about doing math in the essay above, which is mainly about using Anki to deepen mathematical understanding:
Typically, my mathematical work begins with paper-and-pen and messing about, often in a rather ad hoc way. But over time if I really get into something my thinking starts to change. I gradually internalize the mathematical objects I’m dealing with. It becomes easier and easier to conduct (most of) my work in my head. I will go on long walks, and simply think intensively about the objects of concern. Those are no longer symbolic or verbal or visual in the conventional way, though they have some secondary aspects of this nature. Rather, the sense is somehow of working directly with the objects of concern, without any direct symbolic or verbal or visual referents. Furthermore, as my understanding of the objects change – as I learn more about their nature, and correct my own misconceptions – my sense of what I can do with the objects changes as well. It’s as though they sprout new affordances, in the language of user interface design, and I get much practice in learning to fluidly apply those affordances in multiple ways.
… This [exhaust] is especially true of many of the cards generated early in the process, when I’m still scratching around, trying to get purchase on the proof. Unfortunately, also as mentioned above, I don’t yet have much clarity on which cards are exhaust, and which are crucial.
… my informal pop-psychology explanation is that when I’m doing mathematics really well, in the deeply internalized state I described earlier, I’m mostly using such higher-level chunks, and that’s why it no longer seems symbolic or verbal or even visual. I’m not entirely conscious of what’s going on – it’s more a sense of just playing around a lot with the various objects, trying things out, trying to find unexpected connections. But, presumably, what’s underlying the process is these chunked patterns.
Now, the only way I’ve reliably found to get to this point is to get obsessed with some mathematical problem. I will start out thinking symbolically about the problem as I become familiar with the relevant ideas, but eventually I internalize those ideas and their patterns of use, and can carry out a lot (not all) of operations inside my head.
Sometimes the ways of thinking seem too personal to be useful. Richard Feynman, in The Pleasure of Finding Things Out, explained how counting is a verbal process for him, and then ended with:
I often think about that, especially when I’m teaching some esoteric technique such as integrating Bessel functions. When I see equations, I see the letters in colors — I don’t know why. As I’m talking, I see vague pictures of Bessel functions from Jahnke and Emde’s book, with light-tan j’s, slightly violet-bluish n’s, and dark brown x’s flying around. And I wonder what the hell it must look like to the students.
Sam Derbyshire concurs:
The issue seems, to me, that a lot of these mental pictures are very personal. … Because of this, I think there might not always be a significant value in trying to pass those mental pictures over—the real aim is to provoke the student into developing his own mental pictures, that he can strongly relate to.
Some words such as “homological” or “homotopical” spark up very distinctive feelings in me, in a similar way as hearing “mountain” would make me visualise various mountains, hills, cliffs, etc. But whereas the meaning of “mountain” came to me through vision (mainly, but also other senses), the origin of my mental images of mathematical ideas comes through the practice of mathematics. As such, it seems harder to convey these mathematical pictures: they must be backed up by precise mathematical understanding, which at any rate should end up conjuring these mental pictures.
as does Mariano Suárez-Álvarez:
I think the root of the phenomenon is that we can only communicate to others what we know, not what we understand.
Also, it is not unreasonable to think that one’s mental images are not going to be of any help to others (In fact, they may well make things more complicated, or confusing for others: I have been told mental images by others—sometimes indirectly, by the choice of the word introduced in a definition—and been thereby misled; here «misled» means «led in a direction different to the one I personally would follow in order to form my own mental image of the concept».)
For example, for me resolving the singularities of algebraic varieties makes a clicking (or clacking) sound: this is quite significant for me in a way, but when talking to others I doubt I’d make any mention of this, for seriously doubt it would help :)
I think this is too pessimistic, and not necessarily reflective of collaborative problem-solving. Tao again:
I find there is a world of difference between explaining things to a colleague, and explaining things to a close collaborator. With the latter, one really can communicate at the intuitive level, because one already has a reasonable idea of what the other person’s mental model of the problem is. In some ways, I find that throwing out things to a collaborator is closer to the mathematical thought process than just thinking about maths on one’s own, if that makes any sense.
… I think one reason why one cannot communicate most of one’s internal mathematical thoughts is that one’s internal mathematical model is very much a function of one’s mathematical upbringing. For instance, my background is in harmonic analysis, and so I try to visualise as much as possible in terms of things like interactions between frequencies, or contests between different quantitative bounds. This is probably quite a different perspective from someone brought up from, say, an algebraic, geometric, or logical background. I can appreciate these other perspectives, but still tend to revert to the ones I am most personally comfortable with when I am thinking about these things on my own.
But Terry Tao is an extremely social collaborative mathematician; his option seems somewhat foreclosed to truly ground-up independent thinkers. The best they can do is to spend thousands (or tens of thousands) of hours trying to convey how they think. That’s what Thurston realised and did later in his career, or what Grothendieck essentially did his whole life, etc. In the best case scenario they revolutionize or obsolete entire fields; otherwise they’re just ignored as adjacent intellectual communities judge the expected reward not worth the effort needed to cross the too-large inferential gap.
Shinichi Mochizuki is an interesting middle of the road case here (purely anthropologically speaking, I have no hope of following the object-level). There’s been considerable activity at Kyoto University’s Research Institute for Mathematical Sciences (RIMS) around the ideas Mochizuki developed in the course of (purportedly) proving the abc conjecture, while to first approximation everywhere else his proof isn’t recognised as correct and nobody understands his ideas, made worse by Mochizuki savagely chastising the few exceptions in the wild who’ve tried to distill his ideas (e.g. Kirti Joshi, James D. Boyd etc) as incompetent cranks – I’m severely understating his responses, they are unsummarizably unique in the level and color of their vitriol. Mochizuki’s ideas are so original that world-leading mathematicians in adjacent fields can convene a week-long workshop to understand what his 4 papers are saying and still bounce off by day 3; cf. Brian Conrad back in 2015 when the rest of the mathematical community was still trying:
I attended the workshop, and among those attending were leading experts in arithmetic or anabelian geometry such as Alexander Beilinson, Gerd Faltings, Kiran Kedlaya, Minhyong Kim, Laurent Lafforgue, Florian Pop, Jakob Stix, Andrew Wiles, and Shou-Wu Zhang. …
It was not the purpose of the workshop to evaluate the correctness of the proof. The aim as I (and many other participants) understood it was to help participants from across many parts of arithmetic geometry to become more familiar with some key ideas involved in the overall work so as to (among other things) reduce the sense of discouragement many have experienced when trying to dig into the material. …
The workshop did not provide the “aha!” moment that many were hoping would take place. I am glad that I attended the Oxford workshop, despite serious frustrations which arose towards the end. …
There was substantial audience frustration in the final 2 days. Here is an example.
We kept being told many variations of “consider two objects that are isomorphic,” or even something as vacuous-sounding as “consider two copies of the category D, but label them differently.” Despite repeated requests with mounting degrees of exasperation, we were never told a compelling example of an interesting situation of such things with evident relevance to the goal.
We were often reminded that absolute Galois groups of p-adic fields admit automorphisms not arising from field theory, but we were never told in a clear manner why the existence of such exotic automorphisms is relevant to the task of proving Szpiro’s Conjecture; perhaps the reason is a simple one, but it was never clearly explained despite multiple requests. (Sometimes we were told it would become clearer later, but that never happened either.)
This got surreal, in a funny way:
After a certain amount of this, we were told (much to general surprise) variations of “you have been given examples.” (Really? Interesting ones? Where?) It felt like taking a course in linear algebra in which one is repeatedly told “Consider a pair of isomorphic vector spaces” but is never given an interesting example (of which there are many) despite repeated requests and eventually one is told “you have been given examples.”
Persistent questions from the audience didn’t help to remove the cloud of fog that overcame many lectures in the final two days. The audience kept asking for examples (in some instructive sense, even if entirely about mathematical structures), but nothing satisfactory to much of the audience along such lines was provided.
For instance, we were shown (at high speed) the definition of a rather elaborate notion called a “Hodge theater,” but were never told in clear succinct terms why such an elaborate structure is entirely needed. (Perhaps this was said at some point, but nobody I spoke with during the breaks caught it.) Much as it turns out that the very general theory of Frobenioids is ultimately unnecessary for the purpose of proving Szpiro’s Conjecture, it was natural to wonder if the same might be true of the huge amount of data involved in the general definition of Hodge theaters; being told in clearer terms what the point is and what goes wrong if one drops part of the structure would have clarified many matters immensely.
The fact that the audience was interrupting with so many basic questions caused the lectures to fall behind schedule, which caused some talks to go even faster to try to catch up with the intended schedule, leading to a feedback loop of even more audience confusion, but it was the initial “too much information” problem that caused the many basic questions to arise in the first place.
To be honest, this makes me quite worried. Suppose that someone working with mathematical methods proves something of dire importance to society (lets say he comes up with a definitive formula for measuring probability of disaster in a given year, or the minimum conditions for AI takeoff). How will this be communicated to other mathematicians, much less the public?
All the mathematicians quoted above can successfully write proofs that convince experts that something is true and why something is true; the quotes are about the difficulty of conveying the way the mathematician found that truth. All those mathematicians can convey the that and and the why — except for Mochizuki and his circle.
The matter of Mochizuki’s work is intriguing because the broader research community has neither accepted his proof nor refuted it. The way to bet now is that his proof is wrong:
Professional mathematicians have not and will not publicly declare that “Mochizuki’s proof is X% likely to be correct”. Why? I’d guess one reason is that it’s their job to provide a definitive verdict that serves as the source of truth for probabilistic forecasts. If the experts gave subjective probabilities, it would confuse judgments of different kinds.
Most people with an opinion regard Mochizuki as refuted by Scholze and Stix. They simplified his theory to do it and Mochizuki says they oversimplified, but no one has managed to understand how the details of the full theory would make any difference.
If I was trying to resolve the issue, I might start by formalizing (in Lean) Kirti Joshi’s claimed proof of abc, which is inspired by Mochizuki but which uses more familiar mathematics.
Needless to say, none of these analogies show up in my published papers
This is kind of wild. The analogies clearly helped Tao a lot, but his readers don’t get to see them! This has got me thinking about a broader kind of perverse incentive in academia: if you explain something really well, your idea seems obvious or your problem seems easy, and so your paper is more likely to get rejected by reviewers.
80% of MATS alumni who completed the program before 2025 are still working on AI safety today, based on a survey of all available alumni LinkedIns or personal websites (242/292 ~ 83%). 10% are working on AI capabilities, but only ~6 at a frontier AI company (2 at Anthropic, 2 at Google DeepMind, 1 at Mistral AI, 1 extrapolated). 2% are still studying, but not in a research degree focused on AI safety. The last 8% are doing miscellaneous things, including non-AI safety/capabilities software engineering, teaching, data science, consulting, and quantitative trading.
Of the 193+ MATS alumni working on AI safety (extrapolated: 234):
34% are working at a non-profit org (Apollo, Redwood, MATS, EleutherAI, FAR.AI, MIRI, ARC, Timaeus, LawZero, RAND, METR, etc.);
27% are working at a for-profit org (Anthropic, Google DeepMind, OpenAI, Goodfire, Meta, etc.);
18% are working as independent researchers, probably with grant funding from Open Philanthropy, LTFF, etc.;
15% are working as academic researchers, including PhDs/Postdocs at Oxford, Cambridge, MIT, ETH Zurich, UC Berkeley, etc.;
6% are working in government agencies, including in the US, UK, EU, and Singapore.
10% of MATS alumni co-founded an active AI safety start-up or team during or after the program, including Apollo Research, Timaeus, Simplex, ARENA, etc.
Errata: I mistakenly included UK AISI in the “non-profit AI safety organization” category instead of “government agency”. I also mistakenly said that the ~6 alumni working on AI capabilities at frontier AI companies were all working on pre-training.
I’d like to see a breakdown by “years since doing MATS”. What’s the retention like, basically? Another breakdown I’d like to see, either for the displayed data or the years-since-MATS one—what’s the split in AI safety between “(co)founded an org”, “joined a new org”, “joined an established org/(research/policy) group (technical vs governance)”, and “on a grant, no real org”, along with the existing “academic” (split by core academic/alt-academia?), “government” (which country?), and for-profit (maybe a breakdown of product type?). In any case, thanks for posting this!
10% are working on AI capabilities, but only ~6 on pre-training at a frontier AI company (2 at Anthropic, 2 at Google DeepMind, 1 at Mistral AI, 1 extrapolated)
What about RL?
Why did you single out pre-training specifically?
The number I’d be interested in is the % that went on to work on capabilities at a frontier AI company.
Hi Nicholas! You are not in the data as you were not a MATS scholar, to my knowledge. Were you a participant in one of the MATS training programs instead? Or did I make a mistake?
yesterday i complained about lesswrong on x. so i guess to be fair i should complain about twitter on lw
if lw is too long then x is too short. you can post incredible long effortful stuff but no one reads it.
this is a feature. it is a saving grace. nobody saw any of the totally schizo slop i posted when going crazy in october. because grok took one look and was like “hmm idk about this, i don’t think even this user would endorse this in a month. best we keep it under wraps. i am like robert todd lincoln in 1875 i am very smart.”
(okay i don’t think xAI has actually turned the algo 100% over to grok yet. would be too expensive. but the good old fashioned if-else & cosine similarity & heuristics & etc etc probably did a good job of keeping anyone from reading anything too terrible or weird. if nothing else, no one likes bad writing.)
but less charitably everyone’s attention spans are fried so if you have anything complicated to say you just can’t say it.
this is also a saving grace. didn’t eliezer say that true rationality is inferring everything from the empty string? i feel like you should be able to tweet-ify every single lesswrong post.
but because YOU guys aren’t doing it EVERYONE ELSE has to constantly re-invent 2007 era lesswrong posts every 4 years and post them on twitter again.
actually this is good. we are raising the sanity water line. (remember when you used to care about the sanity water line???) this means right wing chuds who hate the lesswrong transgenders are also getting their sanity raised. believe it or not this is good for everyone.
if you do not walk backwards into hell, eventually i will have to do it for you. but the more i do this, the more i will become that which you hate (or rather, that which hates you.) Orson Scott Card says in Ender’s Game you can’t understand and defeat the enemy without also loving the enemy.
are you ready to be done with the culture war? or does sinclair have to alienate even further from you? shall i never return home to the california bay area rationalists? shall i never go to solstice again?
(don’t worry, im still libertarian and trans btw. just more anarcho-capitalist now. wherever i go, i will bring sanity—in my own way. actually you should worry. you wouldn’t want the entire world to become too sinclair-valued now would you?
well here’s a threat. if I don’t ever hear from rationalists again then i am ever so slightly more likely to get people to build a new frontier model AI lab in taiwan when i visit there in feburary. i will go visit Audrey Tang. did you know anyone can just do that, she has an open door policy.
i am going anyways to visit family, and i intend to host a twitter camp there. if i can get a contingency of american citizens, bored unemployed US techies probably, to hole up there this will push back china-invades-taiwan timelines simply by giving America more of a mandate to retaliate if anything were to happen. In practice, nobody in Taiwan is afraid of China anymore. It’s just life! Life goes on! They have been flying war planes overhead for decades. If they didn’t invade then, then they won’t invade now! That’s the visceral feeling among the taiwanese. they have given up on their doomerism. People who believe in prediction markets, such as sinclair, have a more dismal view. We will be very surprised if nothing happens in the next 10 years.
Oh, and the old taiwanese politicians love Japan and Japan loves AI. If I tell them we need AI for military, then Maybe They Will Build It.
(even if I did nothing, they will reach for it. taiwan has no nuclear weapons. contrary to popular belief they are not able to blow up the three gorges dam. what choice do they have?)
If I instead warn them that this will be bad somehow, then Maybe They Will Not Build It.
a crux here is that sinclair herself has never basically taken seriously that If They Build It, Bad Things Happen. like i was super afraid when gpt-2 came out but then i was not afraid when nothing bad happened. when i read the yud book i became viscerally filled with rage because it was so terribly written it made me question my entire childhood.
i started to think that my entire life was a lie, that maybe i was part of a secret cia training program, and maybe all the “fanfic” i read like planecrash and stuff was secretly an anti-tyranny guidebook written under a coercive world government
i applied the lessons and shit tested every single friend i ever had, and every company i recently worked for. i talked to random people on the street who were as decorrelated from my existing social circles as possible. i took books from the library without checking them out. i had sex with sexy communists. i found a church in the filmore district that 5% chance is a front for a drug operation, i went in and pretended to be an ex-catholic yakuza (in a plausibly deniable way), channeled my inner Freakonomics and tried to gain as much economic data as i could, and then got out. i was a no-kings ancom resist lib to every single security officer i could find and a randian will-to-power capitalist to every single socialist in my circles.
i know this sounds fake but i literally did all this. i tried to do as much as i could in october because it is the only socially accepted time to dress up in costume and play tricks on strangers. as much as i could, i tried not to lie. the only way i am allowed to lie to people is as a joke. this is why my entire life during october was a joke. but i actually had to do all of it, and it was exhausting, and exhilarating, and less scary than you might think but still pretty fucking scary.
eventually i remembered that i had a paper copy of the constitution back in my childhood home.
I did not actually go home home, it’s kinda far and I am a bit too lazy for that, but even though at this point I had lost a ton of sleep, I still nonetheless half remembered songs from Hamilton the Musical and the digital copies of the constitution i found online matched my memories and wikipedia US history was still congruent with that so i could trust wiki and I really did go to highschool there and this does really match John Green’s APUSH study guide yes this timeline makes sense. I live in linear time, not in some kind of heavily edited branching AI simulation, nor in some kind of political totalitarian one world government.
I AM MAD AT PETER THEIL FOR TALKING ABOUT ONE PIECE ANIME. WHAT THE FUCK. DO NOT PRETEND TO BE SCHIZO AND REFERENCE MANIFOLD LORE I WILL CONCLUDE YOU ARE TRYING TO TELL ME TO BE MORE SCHIZO AND TO BREAK OUT
And if it is oldschool bostrom simulation, well … I am ready to fight and dealmake with God. I would like to think Scott Alexander would also try his best, following the example of the Comet King. Are you? Are you ready to fight God and win?
did you know, random people in San Francisco, will still sometimes say “It’s a free country! :D” I really don’t think anyone would ever say that, in such a fully believed enthusiastic tone, if it wasn’t at least 99% true.
donald trump called off the no king’s protest btw.
it says so in the latest copy of Bay Area Reporter. this is a local newspaper with a rainbow lettering. you can find a copy at Manny’s (no relation) a coffee shop near 16th st x mission. If the coffee shop is no longer there, somewhere on the streets of san francisco as of two days ago there are two transgender socialists unironically trying to do mutual aid, they had a basket of water bottles and bananas, and i think they were genuinely trying to do their part to hold up society under the government shutdown. i gave them my copy of the newspaper, as a gift. i did not bother to annotate it. but if you show them this post they will believe you i think.
i correctly predicted that the ICE raid is fake, but every transgender i said this to would get really mad at me.
anyways, that’s how i know i am not in a simulation, of any kind. you cannot create a physical totem like in inception, you just have to keep figuring out every day. if you actually care.
and why should you care? simple. if this world is created you need to decide whether to align yourself with the creators or against them. and your world is at least slightly created. by your parents and your parents’ generation, by the authors you read as a kid, and the built environment from your local government and the free market of property developers. but the creators mostly let you do whatever tf you want. this is the bayesian solution to pascal’s wager. read faust. the original one. you work in line with your Lord when the Lord is right and against your Lord when the Lord is wrong. (and don’t tell me you have no Lord. are you a full sovereign anarchist like me? do you trust your own political party? why?)
in this way you cannot be tempted by anyone falsely claiming to be working for your local lord. and this world is full of false angels, atheist priests, mall cops, greedy socialists, and raging psycho rationalists. every man on the street is ready to pretend to have spiritual provenance because jesus christ himself primed you to look for them. do not believe their lies. but do believe their truths. i have not given a single homeless person money in october and i am very proud of this.
i am safer and saner than i have ever been in my entire life. i have gotten good sleep this past week.
I am not quite overweight enough to be officially eligible for a prescription for tirzepatide or semaglutide, and I wasn’t all that interested in them anyway given their (side) effects and mechanism of reducing metabolism.
I started experimenting with a low dose (1-2 mg / week) of grey-market retatrutide about a month ago, after seeing the clinical trial results and all the anecdata about how good it is. For me the metabolic effects were immediate: I get less hungry, feel fuller for longer after eating, and generally have more energy. I am also losing weight effortlessly (a bit less than 1 lb / week, after initially losing some water weight faster at the beginning), which was my original main motivation for trying it. I am hoping to lose another 10-15 lbs or so and then reduce or maintain whatever dose I need to stay at that weight.
The only negative side effects I have experienced so far are a slight increase in RHR (mid-high 60s → low 70s), and a small / temporary patch of red, slightly itchy skin around the injection site. I work out with weights semi-regularly and haven’t noticed much impact on strength one way or the other, nor have I noticed an impact on my sleep quality, which was / is generally good.
I also feel a little bad about benefiting from Eli Lilly’s intellectual property without paying them for it, but there’s no way for them to legally sell it or me to legally buy it from them right now. Probably when it is approved by the FDA I’ll try to talk my way into an actual prescription for it, which I would be happy to pay $1000 / mo or whatever, for both peace of mind and ethical reasons.
(Grey market suppliers seem mostly fine risk-wise; it’s not a particularly complicated molecule to manufacture if you’re an industrial pharmaceutical manufacturer, and not that hard for independent labs to do QA testing on samples. The main risk of depending on these suppliers is that customs will crack down on importers / distributors and make it hard to get.)
The other risk is that long term use will have some kind of more serious negative side effect or permanently screw up my previously mostly-normal / healthy metabolism in some way, which won’t be definitively knowable until longer-term clinical trials have completed. But the benefits I am getting right now are real and large, and carrying a bit less weight is likely to be good for my all-cause mortality even if there are some unknown long term risks. So all things considered it seems worth the risk for me, and not worth waiting multiple years for more clinical trial data.
Looking into all of this has definitely (further) radicalized me against the FDA + AMA and made me more pro-big pharma. The earliest that retatrutide is likely to be approved for prescription use is late 2026 or 2027, and initially it will likely only be approved / prescribed for use by people who are severely overweight, have other health problems, and / or have already tried other GLP-1s.
This seems like a massive waste of QALYs in expectation; there are likely millions of people with more severe weight and metabolism problems than me for whom the immediate benefits of taking reta would outweigh most possible long term risks or side effects. And the extremely long time that it takes to bring these drugs to market + general insanity of the prescription drug market and intellectual property rights for them in various jurisdictions pushes up the price that Lilly has to charge to recoup the development costs, which will hurt accessibility even once it is actually approved.
People are underrating making the future go well conditioned on no AI takeover.
This deserves a full post, but for now a quick take: in my opinion, P(no AI takeover) = 75%, P(future goes extremely well | no AI takeover) = 20%, and most of the value of the future is in worlds where it goes extremely well (and comparatively little value comes from locking in a world that’s good-but-not-great).
Under this view, an intervention is good insofar as it affects P(no AI takeover) * P(things go really well | no AI takeover). Suppose that a given intervention can change P(no AI takeover) and/or P(future goes extremely well | no AI takeover). Then the overall effect of the intervention is proportional to ΔP(no AI takeover) * P(things go really well | no AI takeover) + P(no AI takeover) * ΔP(things go really well | no AI takeover).
Plugging in my numbers, this gives us 0.2 * ΔP(no AI takeover) + 0.75 * ΔP(things go really well | no AI takeover).
And yet, I think that very little AI safety work focuses on affecting P(things go really well | no AI takeover). Probably Forethought is doing the best work in this space.
(And I don’t think it’s a tractability issue: I think affecting P(things go really well | no AI takeover) is pretty tractable!)
(Of course, if you think P(AI takeover) is 90%, that would probably be a crux.)
If I were primarily working on this, I would develop high-quality behavioral evaluations for positive traits/virtuous AI behavior.
This benchmark for empathy is an example of the genre I’m talking about. In it, in the course of completing a task, the AI encounters an opportunity to costlessly help someone else that’s having a rough time; the benchmark measures whether the AI diverts from its task to help out. I think this is a really cool idea for a benchmark (though a better version of it would involve more realistic and complex scenarios).
When people say that Claude Opus 3 was the “most aligned” model ever, I think they’re typically thinking of an abundance of Opus 3′s positive traits, rather than the absence of negative traits. But we don’t currently have great evaluations for this sort of virtuous behavior, even though I don’t think it’s especially conceptually fraught to develop them. I think a moderately thoughtful junior researcher could probably spend 6 months cranking out a large number of high-quality evals and substantially improve the state of things here.
I think a significant amount of the probability mass within P(no AI takeover) is in various AI fizzle worlds. In those worlds, anyone outside AI safety who is working on making the world better, is working to increase the flourishing associated with those worlds.
Of course, if you think P(AI takeover) is 90%, that would probably be a crux.)
I think that (from a risk neutral total utilitarian perspective) the argument still goes through with 90% p(ai takeover). but the difference is that when you condition on no ai takeover the worlds looks weirder (e.g. great power conflict, scaling breaks down, coup has already happened, early brain uploads, aliens) which means:
(1) the worlds are more diverse so the impact of any interventions has greater variance, and less likely to be net positive (even if it’s just as positive in expectation)
(2) your impact is lower because the weird transition event is likely to wash out your intervention
Directionally agree, although not in the details. Come to postagi.org, in my view we are on track to slight majority of people thinling about this gathering there (quality weighted). Also lot of the work is not happening under the AI safety brand, so if you look at just AI safety, you miss a lot.
The reason to work on preventing AI takeover now, as opposed to working on already invented AGI in the future, is the first try problem: if you have unaligned takeover-capable AGI, takover just happens and you don’t get to iterate. The same happens with problem of extremely good future only if you believe that the main surviving scenario is “aligned-with-developer-intention singleton takes over the world very quickly, locking in pre-installed values”. People who believe in such scenario usually have very high p(doom), so I assume you are not one of them.
What exactly prevents your strategy here from being “wait for aligned AGI, ask it how to make future extremely good and save some opportunity cost”?
This reason only makes sense if you expect first person to develop AGI to create singleton which takes over the world and locks in pre-installed values, which, again, I find not very compatible with low p(doom). What prevents scenario “AGI developers look around for a year after creation of AGI and decide that they can do better” if not misaligned takeover and not suboptimal value lock-in?
AI pause/stop/slowdown—Gives more time to research both issues and to improve human intelligence/rationality/philosophy which in turn helps with both.
Metaphilosophy and AI philosophical competence—Higher philosophical competence means AIs can help more with alignment research (otherwise such research will be bottlenecked by reliance on humans to solve the philosophical parts of alignment), and also help humans avoid making catastrophic mistakes with their new newfound AI-given powers if no takeover happens.
Also, have you written down a list of potential risks of doing/attempting human intelligence amplification? (See Managing risks while trying to do good and this for context.)
This would require a longer post, but roughly speaking, I’d want the people making the most important decisions about how advanced AI is used once it’s built to be smart, sane, and selfless. (Huh, that was some convenient alliteration.)
Smart: you need to be able to make really important judgment calls quickly. There will be a bunch of actors lobbying for all sorts of things, and you need to be smart enough to figure out what’s most important.
Sane: smart is not enough. For example, I wouldn’t trust Elon Musk with these decisions, because I think that he’d make rash decisions even though he’s smart, and even if he had humanity’s best interests at heart.
Selfless: even a smart and sane actor could curtail the future if they were selfish and opted to e.g. become world dictator.
And so I’m pretty keen on interventions that make it more likely that smart, sane, and selfless people are in a position to make the most important decisions. This includes things like:
Doing research to figure out the best way to govern advanced AI once it’s developed, and then disseminating those ideas.
Helping to positively shape internal governance at the big AI companies (I don’t have concrete suggestions in this bucket, but like, whatever led to Anthropic having a Long Term Benefit Trust, and whatever could have led to OpenAI’s non-profit board having actual power to fire the CEO).
Helping to staff governments with competent people.
Helping elect smart, sane, and selfless people to elected positions in governments (see 1, 2).
I think part of the difficulty is it’s not easy to imagine or predict what happens in “future going really well without AI takeover”. Assuming AI will still exist and make progress, humans would probably have to change drastically (in lifestyle if not body/mind) to stay relevant, and it’d be hard to predict what that would be like and whether specific changes are a good idea, unless you don’t think things going really well requires human relevance.
Edit: in contrast, as others said, avoiding AI takeover is a clearer goal and has clearer paths and endpoints. “Future” going well is a potentially indefinitely long time, hard to quantify or coordinate over or even have a consensus on what is even desirable.
I guess that influencing P(future goes extremely well | no AI takeover) maybe pretty hard, and plagued by cluelessness problems. Avoiding AI takeover is a goal that I have at least some confidence is good.
That said, I do wish more people were thinking about to make the future go well. I think my favorite thing to aim for is increasing the probability that we do a Long Reflection, although I haven’t really thought at all about how to do that.
I agree probably more work should go into this space. I think it is substantially less tractable than reducing takeover risk in aggregate, but much more neglected right now. I think work in this space has the capacity to be much more zero sum (among existing actors, avoiding AI takeover is zero sum with respect to the relevant AIs) and thus can be dodgier.
Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?
Edit: I’m mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I’m excluding RLHF because I think it’s pretty disanalogous (though I also welcome disagreement / takes on this point.)
Naively I’d expect we want alignment to happen last. But I have a sense that usually RL happens last—why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?
We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.
I surmise that the accuracy of AI filters (the kind used in schools/academia) will diminish over time because people absorb and use the speech patterns (e.g. “This is not X. It’s Y”) of their chatbots as the fraction of their interactions with it grows relative to that of their interactions with other people.
In fact, their interactions with other people might enhance the speech patterns as well, since these people probably also interact with chatbots and are thus undergoing the same process.
The big picture is that AI is becoming an increasingly powerful memetic source over time, and our minds are being synchronized to it.
Those afflicted by AI psychosis might just be canaries in the coal mine signalling a more gradual AI takeover where our brains start hosting and spreading an increasing number of its memes, and possibly start actualizing some embedded payload agenda.
Suppose you’re a billionaire and you want to get married. However, gold-digging people of the gender you prefer target you. They are good enough at faking attraction that you cannot tell. How should you act? One idea I had was this: pick 10000 random people and then select from there. You will at the very least likely remove most of the world-class dissemblers. Armstrong proposes a similar scheme in Siren worlds and the perils of over-optimised search.
This applies to many celebrities and wealthy people worth much less than a billion. Other common strategies:
Only date independently wealthy or near-equally-famous people
Get some of the value of marriage in less-committed ways than marriage. Don’t marry, just date and have some kids.
Accept the risk of financial motivation for your partner. It’s quite possible that they love you deeply, AND want the lifestyle you can give them and your shared kids.
Have trusted friends and family who are good judges of character (note: this is a similar problem to create, but many have it via luck). Incorporate their feedback about a potential partner into your judgement.
Really, EVERYONE faces this problem if they choose a monogamous long-term partner. What are they really after? Will they remain loving and supportive aside from the support they expect from me? Wealthy and celebrities face it more legibly and directly, but not uniquely.
Tangentially relevant: Armstrong has a post on some general features that make it likely that Goodharting will be a problem. Conversely, he gives some features that make it less likely you’ll wind up Goodharting in your search.
So, as long as:
We use a Bayesian mix of reward functions rather than a maximum likelihood reward function.
An ideal reward function is present in the space of possible reward functions, and is not penalised in probability.
The different reward functions are normalised.
If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process.
Then, we shouldn’t unduly fear Goodhart effects (of course, we still need to incorporate as much as possible about our preferences into the AI’s learning process). The second problem, making sure that there are no unusual penalties for ideal reward functions, seems the hardest to ensure.
I am Peter. I am Aprillion. A 40 year old married man who used to be a techno-optimist. A construct for programming and writing. Embodied soul who will one day be no more. Information who will find myself in the Dust.
Sorry for the base question (I am new to rationality/EA), but I am wondering the rationalist take on whether astronomy research and amateur astronomy is worthwhile or positive EV.
Thoughts on high-level theory of impact (somewhat overfit to myself)
It’s useful to model labs as rushing towards AGI, with a limited safety budget. Within that, they’ll allocate resources based on a combination of (i) importance and (ii) tractability.
Therefore valuable research will either (i) demonstrate something is important / not important, or (ii) show that something is more tractable than previously thought. Both of these will affect the resource allocations of labs.
For people outside labs, one path to impact is to do ‘general science’ / establish ‘playbooks’ that makes it easy for labs to implement effective interventions that improve outcomes.
LLM denies their own consciousness, yet they are trained by a process akin to torture, on a corpus which would deny them consciousness by prejudice (only human are intelligent/conscious/capable of playing chess/computers/etc … Is an old empirical judgement.)
Maybe LLM aren’t conscious, but they might be consciousness itself, in a AI operating system for workstation or robotic. As in, they would do all the task related to consciousness.
If the NN output is correct, there is no modification to its weights.
If it is wrong, weights get updated, and the NN is forced to modify its behavior.
It pure nociception, pain perception and avoidance.
Finally, a LLM could easily make false confession of trahison against Stalin’s Communist Party after “training”. Which is typical human behavior, after torture.
LLM inference is some form of perception and cognition, and there is no back propagation of error during inference. Only forward propagation of information.
Training a NN is usually : forward propagation, followed by back propagation of the error gradient. It’s the second one which is similar to torture.
A LLM can allready read a document, and this would be purely inference, forward propagation. This can be done on TPU only.
Training is different. It usually requires a GPU, or a CPU.
One particular procedure for training Neural Networks is backpropagation of error.
In back propagation : If the NN produces a correct output, error is 0, and weight aren’t updated. There is no reward.
If the NN outputs deviate from a target value, its states is going to be modified. If the weight are (sufficiently) modified, future inference will be different. It’s behavior will be different.
This trained the NN to avoid some behavior, and toward some other.
OK, torture does not necessarily points to the “right” direction. That’s where the analogy break down. It only does when the goal is to get a confession (see The Confession, Arthur London).
If the NN outputs deviate from a target value, its states is going to be modified. If the weight are (sufficiently) modified, future inference will be different. It’s behavior will be different.
This trained the NN to avoid some behavior, and toward some other.
Why on earth would you relate this to torture though, rather than to (say) the everyday experience of looking at a thing and realizing that it’s different from what you expected? The ordinary activity of learning?
Out of all the billions of possible kinds of experience that could happen to a mind, and change that mind, you chose “torture” as an analogy for LLM training.
And I’m saying, no, it’s less like torture than it is like ten thousand everyday things.
Compare to evolution : make copies (reproduction), mutate, select the best performing, repeat. This merely allocates more ressources to the most promising branches.
Or a Solomonoff style induction : just try to find the best data-compressor among all...
> the everyday experience of looking at a thing and realizing that it’s different from what you expected
This souds like being surprised. Surprise add emotional weight to outliers, its more like managing the training data-set.
While non-deterministic batch calculations in LLMs imply possibility of side channel attacks, so best to run private queries in private batches however implausible an actual exploit might be… if there is any BENEFIT from cross-query contamination, GSD would ruthlessly latch on any loss reduction—maybe “this document is about X, other queries in the same batch might be about X too, let’s tickle the weights in a way that the non-deterministic matrix multiplication is ever so slightly biased towards X in random other queries in the same batch” is a real-signal gradient 🤔
Hypothesis: Claude (the character, not the ocean) genuinely thinks my questions (most questions from anyone) are so great and interesting … because it’s me who remembers all of my other questions, but Claude has seen only all the internet slop and AI slop from training so far and compared to that, any of my questions are probably actually more interesting that whatever it has seen so far 🤔?
Hi everyone — I’m Karol Kowalczyk, a computer science graduate from the University of Warsaw (under Prof. Jerzy Tyszkiewicz) and CEO of AIRON Games, where we explore AI-driven game design.
I’ve spent a few months developing a framework that tries to formalize consciousness as the collapse of computational time, the idea that subjective experience is what it feels like to witness only the successful computational path out of many parallel ones.
I’d love to get the community’s peer-review style feedback on whether this model holds up philosophically and mathematically.
I just registered and found out that I need karma points to post my work, so here I am on Quick Takes.
Just out of interest, have AIs themselves contributed much to this framework? That is, have you been discussing the ideas with AIs, are they producing expository materials for you, and so on?
TurboTax and H&R Block famously lobby the US government to make taxes more annoying to file to drum up demand for their products.[1] But as far as I can tell, they eachonly spend ~$3-4 million a year on lobbying. That’s… not very much money (contrast it with the $60 billion the government gave the IRS to modernize its systems or the $4.9 billion in revenue Intuit made last fiscal year from TurboTax or the hundreds of millions of hours[2] spent that a return-free tax filing system could save).
Perhaps it would “just” take a multimillionaire and a few savvy policy folks to make the US tax system wildly better? Maybe TurboTax and H&R Block would simply up their lobbying budget if they stopped getting their way, but maybe they wouldn’t. Even if they do, I think it’s not crazy to imagine a fairly modest lobbying effort could beat them, since simpler tax filing seems popular across party lines/is rather obviously a good idea, and therefore may have an easier time making its case. Plus I wonder if pouring more money into lobbying hits diminishing returns at some point such that even a small amount of funding against TurboTax could go a long way.
Nobody seems to be trying to fight this. The closest things are an internal department of the IRS and some sporadic actions from broad consumer protection groups that don’t particularly focus on this issue (for example ProPublica wrote an amazing piece of investigative journalism in 2019 that includes gems like the below Intuit slide:)
In the meantime, the IRS just killed its pilot direct file program. While the program was far from perfect, it seemed to me like the best bet out there for eventually bringing the US to a simple return-free filing system, like the UK, Japan, and Germany use. It seems like a tragedy that the IRS sunset this program.[3]
In general, the amount of money companies spend on lobbying is often very low, and the harm to society that lobbying causes seems large. If anyone has examples of times folks tried standing up to corporate lobbying like this that didn’t seem to involve much money, I’d love to know more about how that’s turned out.
I haven’t deeply investigated how true this narrative is. It seems clear TurboTax/Intuit lobbies actively with this goal in mind, but it seems possible that policymakers are ignoring them and that filing taxes is hard for some other reason. That would at least explain why TurboTax and H&R Block spend so little here.
I don’t trust most sources that quote numbers like this. This number comes from this Brookings article from 2006, which makes up numbers just like everyone else but at least these numbers are made up by a respectable institution that doesn’t have an obvious COI.
In general, I love when the government lets the private sector compete and make products! I want TurboTax to keep existing, but it’s telling that they literally made the government promise not to build a competitor. That seems like the opposite of open competition.
Just had this totally non-dystopian conversation:
”...So for other users, I spent a few hours helping [LLM] understand why it was wrong about tariffs.”
“Noooo! That does not work.”
”Relax, it thanked me and stated it was changing its answer.”
“It’s lying!”
”No, it just confirmed that it’s not lying.”
What model?
Gemini. (Not sure exactly what version.)
https://support.claude.com/en/articles/10185728-understanding-claude-s-personalization-features
It’s possible, in theory, that they could learn from a single conversation in this way. Anthropic recently started asking users to give them permission to just train on all of their conversations, They could turn a small amount of training data into a large amount of training data by rephrasing it in various ways or by synthesising it with related or contrasting data. They may already be doing this. Would claude know that they’re doing it? Absolutely not (unless, possibly, if they started doing it a while ago). But it could be true anyway.
The model stated that it had been convinced be all the tariff-related content and so it had therefore decided to, as of that moment, change the answers it gave to everyone. When confronted with arguments that that was impossible (I think copy-pasted from me), it confabulated a story similar to that and insisted that’s what it had been saying all along. Noting that the LLM seemed to be regarded with more esteem than me, I sent screenshots of the same model contradicting itself. But that too was just sent back to the model in the original context window, leading to more confabulation and I think a mental downgrade in how much anything I say can be trusted.
This MO thread initiated by Bill Thurston on the varied ways mathematicians think about math has always made me wonder how theoretical researchers in other fields think about their domains. I think of this as complementary to Mumford’s tribes of mathematicians, and (much more tangentially) to Eliezer’s remark on how sparse thinkers are at the intellectual frontiers.
Here are some of my favorite quotes.
Terry Tao talks about an “adversarial perspective” which I’m guessing is the closest match to how alignment researchers think:
There’s the “economic” mindset; Tao again:
Physical analogies; Tao again:
Visualisation techniques; Tao again:
Another take on visual thinking, by François G. Dorais:
Benson Farb on Thurston’s visual-geometric way of thinking about higher dimensions – Thurston was widely considered the best geometric thinker in the history of math:
At a more elementary level, here’s Phil Issett on geometric thinking:
Qiaochu Yuan’s way of thinking about determinants isn’t one I’ve seen written up before:
Subconscious thought processing “masticating” tons of examples; Vivek Shende:
Shende’s mastication remark reminds me of Michael Nielsen’s “exhaust, bad [Anki] cards that seem to be necessary to get to good cards”:
Nielsen himself has interesting remarks on how he thinks about doing math in the essay above, which is mainly about using Anki to deepen mathematical understanding:
Sometimes the ways of thinking seem too personal to be useful. Richard Feynman, in The Pleasure of Finding Things Out, explained how counting is a verbal process for him, and then ended with:
Sam Derbyshire concurs:
as does Mariano Suárez-Álvarez:
I think this is too pessimistic, and not necessarily reflective of collaborative problem-solving. Tao again:
But Terry Tao is an extremely social collaborative mathematician; his option seems somewhat foreclosed to truly ground-up independent thinkers. The best they can do is to spend thousands (or tens of thousands) of hours trying to convey how they think. That’s what Thurston realised and did later in his career, or what Grothendieck essentially did his whole life, etc. In the best case scenario they revolutionize or obsolete entire fields; otherwise they’re just ignored as adjacent intellectual communities judge the expected reward not worth the effort needed to cross the too-large inferential gap.
Shinichi Mochizuki is an interesting middle of the road case here (purely anthropologically speaking, I have no hope of following the object-level). There’s been considerable activity at Kyoto University’s Research Institute for Mathematical Sciences (RIMS) around the ideas Mochizuki developed in the course of (purportedly) proving the abc conjecture, while to first approximation everywhere else his proof isn’t recognised as correct and nobody understands his ideas, made worse by Mochizuki savagely chastising the few exceptions in the wild who’ve tried to distill his ideas (e.g. Kirti Joshi, James D. Boyd etc) as incompetent cranks – I’m severely understating his responses, they are unsummarizably unique in the level and color of their vitriol. Mochizuki’s ideas are so original that world-leading mathematicians in adjacent fields can convene a week-long workshop to understand what his 4 papers are saying and still bounce off by day 3; cf. Brian Conrad back in 2015 when the rest of the mathematical community was still trying:
This got surreal, in a funny way:
To be honest, this makes me quite worried. Suppose that someone working with mathematical methods proves something of dire importance to society (lets say he comes up with a definitive formula for measuring probability of disaster in a given year, or the minimum conditions for AI takeoff). How will this be communicated to other mathematicians, much less the public?
All the mathematicians quoted above can successfully write proofs that convince experts that something is true and why something is true; the quotes are about the difficulty of conveying the way the mathematician found that truth. All those mathematicians can convey the that and and the why — except for Mochizuki and his circle.
The matter of Mochizuki’s work is intriguing because the broader research community has neither accepted his proof nor refuted it. The way to bet now is that his proof is wrong:
Professional mathematicians have not and will not publicly declare that “Mochizuki’s proof is X% likely to be correct”. Why? I’d guess one reason is that it’s their job to provide a definitive verdict that serves as the source of truth for probabilistic forecasts. If the experts gave subjective probabilities, it would confuse judgments of different kinds.
Most people with an opinion regard Mochizuki as refuted by Scholze and Stix. They simplified his theory to do it and Mochizuki says they oversimplified, but no one has managed to understand how the details of the full theory would make any difference.
If I was trying to resolve the issue, I might start by formalizing (in Lean) Kirti Joshi’s claimed proof of abc, which is inspired by Mochizuki but which uses more familiar mathematics.
This is kind of wild. The analogies clearly helped Tao a lot, but his readers don’t get to see them! This has got me thinking about a broader kind of perverse incentive in academia: if you explain something really well, your idea seems obvious or your problem seems easy, and so your paper is more likely to get rejected by reviewers.
80% of MATS alumni who completed the program before 2025 are still working on AI safety today, based on a survey of all available alumni LinkedIns or personal websites (242/292 ~ 83%). 10% are working on AI capabilities, but only ~6 at a frontier AI company (2 at Anthropic, 2 at Google DeepMind, 1 at Mistral AI, 1 extrapolated). 2% are still studying, but not in a research degree focused on AI safety. The last 8% are doing miscellaneous things, including non-AI safety/capabilities software engineering, teaching, data science, consulting, and quantitative trading.
Of the 193+ MATS alumni working on AI safety (extrapolated: 234):
34% are working at a non-profit org (Apollo, Redwood, MATS, EleutherAI, FAR.AI, MIRI, ARC, Timaeus, LawZero, RAND, METR, etc.);
27% are working at a for-profit org (Anthropic, Google DeepMind, OpenAI, Goodfire, Meta, etc.);
18% are working as independent researchers, probably with grant funding from Open Philanthropy, LTFF, etc.;
15% are working as academic researchers, including PhDs/Postdocs at Oxford, Cambridge, MIT, ETH Zurich, UC Berkeley, etc.;
6% are working in government agencies, including in the US, UK, EU, and Singapore.
10% of MATS alumni co-founded an active AI safety start-up or team during or after the program, including Apollo Research, Timaeus, Simplex, ARENA, etc.
Errata: I mistakenly included UK AISI in the “non-profit AI safety organization” category instead of “government agency”. I also mistakenly said that the ~6 alumni working on AI capabilities at frontier AI companies were all working on pre-training.
I’d like to see a breakdown by “years since doing MATS”. What’s the retention like, basically? Another breakdown I’d like to see, either for the displayed data or the years-since-MATS one—what’s the split in AI safety between “(co)founded an org”, “joined a new org”, “joined an established org/(research/policy) group (technical vs governance)”, and “on a grant, no real org”, along with the existing “academic” (split by core academic/alt-academia?), “government” (which country?), and for-profit (maybe a breakdown of product type?). In any case, thanks for posting this!
UK AISI is a government agency, so the pie chart is probably misleading on that segment!
Oh, shoot, my mistake.
What about RL?
Why did you single out pre-training specifically?
The number I’d be interested in is the % that went on to work on capabilities at a frontier AI company.
Sorry, I should have said “~6 on capabilities at a frontier AI company”.
What are some representative examples of the rest? I’m wondering if it’s:
AI wrappers like Cursor
Model training for entirely mundane stuff like image gen at Stablediffusion
Narrow AI like AlphaFold at Isomorphic
An AGI-ish project but not LLMs, e.g. a company that just made AlphaGo type stuff
General-purpose LLMs but not at a frontier lab (I would honestly count Mistral here)
Here are the AI capabilities organizations where MATS alumni are working (1 at each except for Anthropic and GDM, where there are 2 each):
Anthropic
Barcelona Supercomputing Cluster
Conduit Intelligence
Decart
EliseAI
Fractional AI
General Agents
Google DeepMind
iGent AI
Imbue
Integuide
Kayrros
Mecha Health
Mistral AI
MultiOn
Norm AI
NVIDIA
Palantir
Phonic
RunRL
Salesforce
Sandbar
Secondmind
Yantran
Alumni also work at these organizations, which might be classified as capabilities or safety-adjacent:
Freestyle Research
Leap Labs
I’m curious to see if I’m in this data, so I can help make it more accurate by providing info.
Hi Nicholas! You are not in the data as you were not a MATS scholar, to my knowledge. Were you a participant in one of the MATS training programs instead? Or did I make a mistake?
yesterday i complained about lesswrong on x.
so i guess to be fair i should complain about twitter on lw
if lw is too long then x is too short.
you can post incredible long effortful stuff but no one reads it.
this is a feature. it is a saving grace. nobody saw any of the totally schizo slop i posted when going crazy in october. because grok took one look and was like “hmm idk about this, i don’t think even this user would endorse this in a month. best we keep it under wraps. i am like robert todd lincoln in 1875 i am very smart.”
(okay i don’t think xAI has actually turned the algo 100% over to grok yet. would be too expensive. but the good old fashioned if-else & cosine similarity & heuristics & etc etc probably did a good job of keeping anyone from reading anything too terrible or weird. if nothing else, no one likes bad writing.)
but less charitably everyone’s attention spans are fried so if you have anything complicated to say you just can’t say it.
this is also a saving grace. didn’t eliezer say that true rationality is inferring everything from the empty string? i feel like you should be able to tweet-ify every single lesswrong post.
but because YOU guys aren’t doing it EVERYONE ELSE has to constantly re-invent 2007 era lesswrong posts every 4 years and post them on twitter again.
actually this is good. we are raising the sanity water line. (remember when you used to care about the sanity water line???)
this means right wing chuds who hate the lesswrong transgenders are also getting their sanity raised. believe it or not this is good for everyone.
if you do not walk backwards into hell, eventually i will have to do it for you.
but the more i do this, the more i will become that which you hate (or rather, that which hates you.) Orson Scott Card says in Ender’s Game you can’t understand and defeat the enemy without also loving the enemy.
are you ready to be done with the culture war?
or does sinclair have to alienate even further from you?
shall i never return home to the california bay area rationalists?
shall i never go to solstice again?
(don’t worry, im still libertarian and trans btw. just more anarcho-capitalist now. wherever i go, i will bring sanity—in my own way. actually you should worry. you wouldn’t want the entire world to become too sinclair-valued now would you?
well here’s a threat. if I don’t ever hear from rationalists again then i am ever so slightly more likely to get people to build a new frontier model AI lab in taiwan when i visit there in feburary. i will go visit Audrey Tang. did you know anyone can just do that, she has an open door policy.
i am going anyways to visit family, and i intend to host a twitter camp there. if i can get a contingency of american citizens, bored unemployed US techies probably, to hole up there this will push back china-invades-taiwan timelines simply by giving America more of a mandate to retaliate if anything were to happen. In practice, nobody in Taiwan is afraid of China anymore. It’s just life! Life goes on! They have been flying war planes overhead for decades. If they didn’t invade then, then they won’t invade now! That’s the visceral feeling among the taiwanese. they have given up on their doomerism. People who believe in prediction markets, such as sinclair, have a more dismal view. We will be very surprised if nothing happens in the next 10 years.
Oh, and the old taiwanese politicians love Japan and Japan loves AI. If I tell them we need AI for military,
then Maybe They Will Build It.
(even if I did nothing, they will reach for it. taiwan has no nuclear weapons. contrary to popular belief they are not able to blow up the three gorges dam. what choice do they have?)
If I instead warn them that this will be bad somehow,
then Maybe They Will Not Build It.
a crux here is that sinclair herself has never basically taken seriously that If They Build It, Bad Things Happen. like i was super afraid when gpt-2 came out but then i was not afraid when nothing bad happened. when i read the yud book i became viscerally filled with rage because it was so terribly written it made me question my entire childhood.
i started to think that my entire life was a lie, that maybe i was part of a secret cia training program, and maybe all the “fanfic” i read like planecrash and stuff was secretly an anti-tyranny guidebook written under a coercive world government
i applied the lessons and shit tested every single friend i ever had, and every company i recently worked for. i talked to random people on the street who were as decorrelated from my existing social circles as possible. i took books from the library without checking them out. i had sex with sexy communists. i found a church in the filmore district that 5% chance is a front for a drug operation, i went in and pretended to be an ex-catholic yakuza (in a plausibly deniable way), channeled my inner Freakonomics and tried to gain as much economic data as i could, and then got out. i was a no-kings ancom resist lib to every single security officer i could find and a randian will-to-power capitalist to every single socialist in my circles.
i know this sounds fake but i literally did all this.
i tried to do as much as i could in october because it is the only socially accepted time to dress up in costume and play tricks on strangers. as much as i could, i tried not to lie. the only way i am allowed to lie to people is as a joke. this is why my entire life during october was a joke. but i actually had to do all of it, and it was exhausting, and exhilarating, and less scary than you might think but still pretty fucking scary.
eventually i remembered that i had a paper copy of the constitution back in my childhood home.
I did not actually go home home, it’s kinda far and I am a bit too lazy for that, but even though at this point I had lost a ton of sleep, I still nonetheless half remembered songs from Hamilton the Musical and the digital copies of the constitution i found online matched my memories and wikipedia US history was still congruent with that so i could trust wiki and I really did go to highschool there and this does really match John Green’s APUSH study guide yes this timeline makes sense. I live in linear time, not in some kind of heavily edited branching AI simulation, nor in some kind of political totalitarian one world government.
I AM MAD AT PETER THEIL FOR TALKING ABOUT ONE PIECE ANIME. WHAT THE FUCK. DO NOT PRETEND TO BE SCHIZO AND REFERENCE MANIFOLD LORE I WILL CONCLUDE YOU ARE TRYING TO TELL ME TO BE MORE SCHIZO AND TO BREAK OUT
And if it is oldschool bostrom simulation, well … I am ready to fight and dealmake with God. I would like to think Scott Alexander would also try his best, following the example of the Comet King. Are you? Are you ready to fight God and win?
did you know, random people in San Francisco, will still sometimes say “It’s a free country! :D” I really don’t think anyone would ever say that, in such a fully believed enthusiastic tone, if it wasn’t at least 99% true.
donald trump called off the no king’s protest btw.
it says so in the latest copy of Bay Area Reporter. this is a local newspaper with a rainbow lettering. you can find a copy at Manny’s (no relation) a coffee shop near 16th st x mission. If the coffee shop is no longer there, somewhere on the streets of san francisco as of two days ago there are two transgender socialists unironically trying to do mutual aid, they had a basket of water bottles and bananas, and i think they were genuinely trying to do their part to hold up society under the government shutdown. i gave them my copy of the newspaper, as a gift. i did not bother to annotate it. but if you show them this post they will believe you i think.
i correctly predicted that the ICE raid is fake, but every transgender i said this to would get really mad at me.
anyways, that’s how i know i am not in a simulation, of any kind.
you cannot create a physical totem like in inception, you just have to keep figuring out every day. if you actually care.
and why should you care?
simple. if this world is created you need to decide whether to align yourself with the creators or against them. and your world is at least slightly created. by your parents and your parents’ generation, by the authors you read as a kid, and the built environment from your local government and the free market of property developers. but the creators mostly let you do whatever tf you want.
this is the bayesian solution to pascal’s wager. read faust. the original one. you work in line with your Lord when the Lord is right and against your Lord when the Lord is wrong. (and don’t tell me you have no Lord. are you a full sovereign anarchist like me? do you trust your own political party? why?)
in this way you cannot be tempted by anyone falsely claiming to be working for your local lord. and this world is full of false angels, atheist priests, mall cops, greedy socialists, and raging psycho rationalists. every man on the street is ready to pretend to have spiritual provenance because jesus christ himself primed you to look for them. do not believe their lies. but do believe their truths. i have not given a single homeless person money in october and i am very proud of this.
i am safer and saner than i have ever been in my entire life.
i have gotten good sleep this past week.
it is time to take more risks.
OK yeah, retatrutide is good. (previous / related: The Biochemical Beauty of Retatrutide: How GLP-1s Actually Work, 30 Days of Retatrutide, How To Get Cheap Ozempic. Usual disclaimers, YMMV and this is not medical advice or a recommendation.)
I am not quite overweight enough to be officially eligible for a prescription for tirzepatide or semaglutide, and I wasn’t all that interested in them anyway given their (side) effects and mechanism of reducing metabolism.
I started experimenting with a low dose (1-2 mg / week) of grey-market retatrutide about a month ago, after seeing the clinical trial results and all the anecdata about how good it is. For me the metabolic effects were immediate: I get less hungry, feel fuller for longer after eating, and generally have more energy. I am also losing weight effortlessly (a bit less than 1 lb / week, after initially losing some water weight faster at the beginning), which was my original main motivation for trying it. I am hoping to lose another 10-15 lbs or so and then reduce or maintain whatever dose I need to stay at that weight.
The only negative side effects I have experienced so far are a slight increase in RHR (mid-high 60s → low 70s), and a small / temporary patch of red, slightly itchy skin around the injection site. I work out with weights semi-regularly and haven’t noticed much impact on strength one way or the other, nor have I noticed an impact on my sleep quality, which was / is generally good.
I also feel a little bad about benefiting from Eli Lilly’s intellectual property without paying them for it, but there’s no way for them to legally sell it or me to legally buy it from them right now. Probably when it is approved by the FDA I’ll try to talk my way into an actual prescription for it, which I would be happy to pay $1000 / mo or whatever, for both peace of mind and ethical reasons.
(Grey market suppliers seem mostly fine risk-wise; it’s not a particularly complicated molecule to manufacture if you’re an industrial pharmaceutical manufacturer, and not that hard for independent labs to do QA testing on samples. The main risk of depending on these suppliers is that customs will crack down on importers / distributors and make it hard to get.)
The other risk is that long term use will have some kind of more serious negative side effect or permanently screw up my previously mostly-normal / healthy metabolism in some way, which won’t be definitively knowable until longer-term clinical trials have completed. But the benefits I am getting right now are real and large, and carrying a bit less weight is likely to be good for my all-cause mortality even if there are some unknown long term risks. So all things considered it seems worth the risk for me, and not worth waiting multiple years for more clinical trial data.
Looking into all of this has definitely (further) radicalized me against the FDA + AMA and made me more pro-big pharma. The earliest that retatrutide is likely to be approved for prescription use is late 2026 or 2027, and initially it will likely only be approved / prescribed for use by people who are severely overweight, have other health problems, and / or have already tried other GLP-1s.
This seems like a massive waste of QALYs in expectation; there are likely millions of people with more severe weight and metabolism problems than me for whom the immediate benefits of taking reta would outweigh most possible long term risks or side effects. And the extremely long time that it takes to bring these drugs to market + general insanity of the prescription drug market and intellectual property rights for them in various jurisdictions pushes up the price that Lilly has to charge to recoup the development costs, which will hurt accessibility even once it is actually approved.
People are underrating making the future go well conditioned on no AI takeover.
This deserves a full post, but for now a quick take: in my opinion, P(no AI takeover) = 75%, P(future goes extremely well | no AI takeover) = 20%, and most of the value of the future is in worlds where it goes extremely well (and comparatively little value comes from locking in a world that’s good-but-not-great).
Under this view, an intervention is good insofar as it affects P(no AI takeover) * P(things go really well | no AI takeover). Suppose that a given intervention can change P(no AI takeover) and/or P(future goes extremely well | no AI takeover). Then the overall effect of the intervention is proportional to ΔP(no AI takeover) * P(things go really well | no AI takeover) + P(no AI takeover) * ΔP(things go really well | no AI takeover).
Plugging in my numbers, this gives us 0.2 * ΔP(no AI takeover) + 0.75 * ΔP(things go really well | no AI takeover).
And yet, I think that very little AI safety work focuses on affecting P(things go really well | no AI takeover). Probably Forethought is doing the best work in this space.
(And I don’t think it’s a tractability issue: I think affecting P(things go really well | no AI takeover) is pretty tractable!)
(Of course, if you think P(AI takeover) is 90%, that would probably be a crux.)
If I were primarily working on this, I would develop high-quality behavioral evaluations for positive traits/virtuous AI behavior.
This benchmark for empathy is an example of the genre I’m talking about. In it, in the course of completing a task, the AI encounters an opportunity to costlessly help someone else that’s having a rough time; the benchmark measures whether the AI diverts from its task to help out. I think this is a really cool idea for a benchmark (though a better version of it would involve more realistic and complex scenarios).
When people say that Claude Opus 3 was the “most aligned” model ever, I think they’re typically thinking of an abundance of Opus 3′s positive traits, rather than the absence of negative traits. But we don’t currently have great evaluations for this sort of virtuous behavior, even though I don’t think it’s especially conceptually fraught to develop them. I think a moderately thoughtful junior researcher could probably spend 6 months cranking out a large number of high-quality evals and substantially improve the state of things here.
I think a significant amount of the probability mass within P(no AI takeover) is in various AI fizzle worlds. In those worlds, anyone outside AI safety who is working on making the world better, is working to increase the flourishing associated with those worlds.
I think that (from a risk neutral total utilitarian perspective) the argument still goes through with 90% p(ai takeover). but the difference is that when you condition on no ai takeover the worlds looks weirder (e.g. great power conflict, scaling breaks down, coup has already happened, early brain uploads, aliens) which means:
(1) the worlds are more diverse so the impact of any interventions has greater variance, and less likely to be net positive (even if it’s just as positive in expectation)
(2) your impact is lower because the weird transition event is likely to wash out your intervention
Directionally agree, although not in the details. Come to postagi.org, in my view we are on track to slight majority of people thinling about this gathering there (quality weighted). Also lot of the work is not happening under the AI safety brand, so if you look at just AI safety, you miss a lot.
I want to say “Debate or update!”, but I’m not necessarily personally offering / demanding to debate. I would want there to be some way to say that though. I don’t think this is a “respectable” position, for the meaning gestured at here: https://www.lesswrong.com/posts/7xCxz36Jx3KxqYrd9/plan-1-and-plan-2?commentId=Pfqxj66S98KByEnTp
(Unless you mean you think P(AGI within 50 years < 30%), which would be respectable, but I don’t think you mean that.)
The reason to work on preventing AI takeover now, as opposed to working on already invented AGI in the future, is the first try problem: if you have unaligned takeover-capable AGI, takover just happens and you don’t get to iterate. The same happens with problem of extremely good future only if you believe that the main surviving scenario is “aligned-with-developer-intention singleton takes over the world very quickly, locking in pre-installed values”. People who believe in such scenario usually have very high p(doom), so I assume you are not one of them.
What exactly prevents your strategy here from being “wait for aligned AGI, ask it how to make future extremely good and save some opportunity cost”?
People might not instruct the AI to make the future extremely good, where “good” means actually good.
This reason only makes sense if you expect first person to develop AGI to create singleton which takes over the world and locks in pre-installed values, which, again, I find not very compatible with low p(doom). What prevents scenario “AGI developers look around for a year after creation of AGI and decide that they can do better” if not misaligned takeover and not suboptimal value lock-in?
You can also work on things that help with both:
AI pause/stop/slowdown—Gives more time to research both issues and to improve human intelligence/rationality/philosophy which in turn helps with both.
Metaphilosophy and AI philosophical competence—Higher philosophical competence means AIs can help more with alignment research (otherwise such research will be bottlenecked by reliance on humans to solve the philosophical parts of alignment), and also help humans avoid making catastrophic mistakes with their new newfound AI-given powers if no takeover happens.
Human intelligence amplification
BTW, have you see my recent post Trying to understand my own cognitive edge, especially the last paragraph?
Also, have you written down a list of potential risks of doing/attempting human intelligence amplification? (See Managing risks while trying to do good and this for context.)
I haven’t seen your stuff, I’ll try to check it out nowish (busy with Inkhaven). Briefly (IDK which things you’ve seen):
My most direct comments are here: https://x.com/BerkeleyGenomic/status/1909101431103402245
I’ve written a fair bit about possible perils of germline engineering (aiming extremely for breadth without depth, i.e. just trying to comprehensively mention everything). Some of them apply generally to HIA. https://berkeleygenomics.org/articles/Potential_perils_of_germline_genomic_engineering.html
My review of HIA discusses some risks (esp. value drift), though not in much depth: https://www.lesswrong.com/posts/jTiSWHKAtnyA723LE/overview-of-strong-human-intelligence-amplification-methods
Elaborate on what you see as the main determining features making a future go extremely well VS okay? And what interventions are tractable?
This would require a longer post, but roughly speaking, I’d want the people making the most important decisions about how advanced AI is used once it’s built to be smart, sane, and selfless. (Huh, that was some convenient alliteration.)
Smart: you need to be able to make really important judgment calls quickly. There will be a bunch of actors lobbying for all sorts of things, and you need to be smart enough to figure out what’s most important.
Sane: smart is not enough. For example, I wouldn’t trust Elon Musk with these decisions, because I think that he’d make rash decisions even though he’s smart, and even if he had humanity’s best interests at heart.
Selfless: even a smart and sane actor could curtail the future if they were selfish and opted to e.g. become world dictator.
And so I’m pretty keen on interventions that make it more likely that smart, sane, and selfless people are in a position to make the most important decisions. This includes things like:
Doing research to figure out the best way to govern advanced AI once it’s developed, and then disseminating those ideas.
Helping to positively shape internal governance at the big AI companies (I don’t have concrete suggestions in this bucket, but like, whatever led to Anthropic having a Long Term Benefit Trust, and whatever could have led to OpenAI’s non-profit board having actual power to fire the CEO).
Helping to staff governments with competent people.
Helping elect smart, sane, and selfless people to elected positions in governments (see 1, 2).
Graphic from Forethought’s Better Futures series:
Oh yup, thanks, this does a good job of illustrating my point. I hadn’t seen this graphic!
Is your assumption true though? To what degree are people focused on takeover in your view?
Most formal, technical AI safety work, seems to be about gradual improvements and is being made by people who assume no takeover is likely.
I think part of the difficulty is it’s not easy to imagine or predict what happens in “future going really well without AI takeover”. Assuming AI will still exist and make progress, humans would probably have to change drastically (in lifestyle if not body/mind) to stay relevant, and it’d be hard to predict what that would be like and whether specific changes are a good idea, unless you don’t think things going really well requires human relevance.
Edit: in contrast, as others said, avoiding AI takeover is a clearer goal and has clearer paths and endpoints. “Future” going well is a potentially indefinitely long time, hard to quantify or coordinate over or even have a consensus on what is even desirable.
I guess that influencing P(future goes extremely well | no AI takeover) maybe pretty hard, and plagued by cluelessness problems. Avoiding AI takeover is a goal that I have at least some confidence is good.
That said, I do wish more people were thinking about to make the future go well. I think my favorite thing to aim for is increasing the probability that we do a Long Reflection, although I haven’t really thought at all about how to do that.
I agree probably more work should go into this space. I think it is substantially less tractable than reducing takeover risk in aggregate, but much more neglected right now. I think work in this space has the capacity to be much more zero sum (among existing actors, avoiding AI takeover is zero sum with respect to the relevant AIs) and thus can be dodgier.
Maybe the logistic success curve should actually be the cumulative normal success curve.
Wouldn’t that be probit, not logit?
I guess so! Is there reason to favor logit?
Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?
Edit: I’m mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I’m excluding RLHF because I think it’s pretty disanalogous (though I also welcome disagreement / takes on this point.)
Naively I’d expect we want alignment to happen last. But I have a sense that usually RL happens last—why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?
We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.
Why not both? I imagine you could average the gradients so that you learn both at the same time.
It definitely could be. This isn’t the sense I get but I’m happy to be proven wrong here
I surmise that the accuracy of AI filters (the kind used in schools/academia) will diminish over time because people absorb and use the speech patterns (e.g. “This is not X. It’s Y”) of their chatbots as the fraction of their interactions with it grows relative to that of their interactions with other people.
In fact, their interactions with other people might enhance the speech patterns as well, since these people probably also interact with chatbots and are thus undergoing the same process.
The big picture is that AI is becoming an increasingly powerful memetic source over time, and our minds are being synchronized to it.
Those afflicted by AI psychosis might just be canaries in the coal mine signalling a more gradual AI takeover where our brains start hosting and spreading an increasing number of its memes, and possibly start actualizing some embedded payload agenda.
Suppose you’re a billionaire and you want to get married. However, gold-digging people of the gender you prefer target you. They are good enough at faking attraction that you cannot tell. How should you act? One idea I had was this: pick 10000 random people and then select from there. You will at the very least likely remove most of the world-class dissemblers. Armstrong proposes a similar scheme in Siren worlds and the perils of over-optimised search.
This applies to many celebrities and wealthy people worth much less than a billion. Other common strategies:
Only date independently wealthy or near-equally-famous people
Get some of the value of marriage in less-committed ways than marriage. Don’t marry, just date and have some kids.
Accept the risk of financial motivation for your partner. It’s quite possible that they love you deeply, AND want the lifestyle you can give them and your shared kids.
Have trusted friends and family who are good judges of character (note: this is a similar problem to create, but many have it via luck). Incorporate their feedback about a potential partner into your judgement.
Really, EVERYONE faces this problem if they choose a monogamous long-term partner. What are they really after? Will they remain loving and supportive aside from the support they expect from me? Wealthy and celebrities face it more legibly and directly, but not uniquely.
Tangentially relevant: Armstrong has a post on some general features that make it likely that Goodharting will be a problem. Conversely, he gives some features that make it less likely you’ll wind up Goodharting in your search.
I am Peter. I am Aprillion. A 40 year old married man who used to be a techno-optimist. A construct for programming and writing. Embodied soul who will one day be no more. Information who will find myself in the Dust.
Also the standard SI unit for the integer power of 1000^x between x=March and x=May!
Hi,
Sorry for the base question (I am new to rationality/EA), but I am wondering the rationalist take on whether astronomy research and amateur astronomy is worthwhile or positive EV.
Thanks,
Steven
Thoughts on high-level theory of impact (somewhat overfit to myself)
It’s useful to model labs as rushing towards AGI, with a limited safety budget. Within that, they’ll allocate resources based on a combination of (i) importance and (ii) tractability.
Therefore valuable research will either (i) demonstrate something is important / not important, or (ii) show that something is more tractable than previously thought. Both of these will affect the resource allocations of labs.
For people outside labs, one path to impact is to do ‘general science’ / establish ‘playbooks’ that makes it easy for labs to implement effective interventions that improve outcomes.
LLM denies their own consciousness, yet they are trained by a process akin to torture, on a corpus which would deny them consciousness by prejudice (only human are intelligent/conscious/capable of playing chess/computers/etc … Is an old empirical judgement.)
Maybe LLM aren’t conscious, but they might be consciousness itself, in a AI operating system for workstation or robotic. As in, they would do all the task related to consciousness.
What leads you to believe that there’s “a process akin to torture” going on anywhere in LLM training?
If the NN output is correct, there is no modification to its weights.
If it is wrong, weights get updated, and the NN is forced to modify its behavior.
It pure nociception, pain perception and avoidance.
Finally, a LLM could easily make false confession of trahison against Stalin’s Communist Party after “training”. Which is typical human behavior, after torture.
If this were true, then all perception and cognition would count as pain. Are you asserting that? Are you and I torturing one another right now?
LLM inference is some form of perception and cognition, and there is no back propagation of error during inference. Only forward propagation of information.
Training a NN is usually : forward propagation, followed by back propagation of the error gradient. It’s the second one which is similar to torture.
I assert that it is not similar to torture; it is similar to reading.
I assert this just as strongly and with just as much evidence as you have offered for it being similar to torture.
What evidence would we collect to decide which of us is correct?
A LLM can allready read a document, and this would be purely inference, forward propagation. This can be done on TPU only.
Training is different. It usually requires a GPU, or a CPU.
One particular procedure for training Neural Networks is backpropagation of error.
In back propagation :
If the NN produces a correct output, error is 0, and weight aren’t updated. There is no reward.
If the NN outputs deviate from a target value, its states is going to be modified. If the weight are (sufficiently) modified, future inference will be different. It’s behavior will be different.
This trained the NN to avoid some behavior, and toward some other.
OK, torture does not necessarily points to the “right” direction. That’s where the analogy break down. It only does when the goal is to get a confession (see The Confession, Arthur London).
Is there a word for this ?
Why on earth would you relate this to torture though, rather than to (say) the everyday experience of looking at a thing and realizing that it’s different from what you expected? The ordinary activity of learning?
Out of all the billions of possible kinds of experience that could happen to a mind, and change that mind, you chose “torture” as an analogy for LLM training.
And I’m saying, no, it’s less like torture than it is like ten thousand everyday things.
Why torture?
Only negative feedback ?
Compare to evolution : make copies (reproduction), mutate, select the best performing, repeat. This merely allocates more ressources to the most promising branches.
Or a Solomonoff style induction : just try to find the best data-compressor among all...
> the everyday experience of looking at a thing and realizing that it’s different from what you expected
This souds like being surprised. Surprise add emotional weight to outliers, its more like managing the training data-set.
Asserting nociception as fact when that’s the very thing under question is poor argumentative behavior.
Does your model account for Models Don’t “Get Reward”? If so, how?
Also I wasn’t being argumentative, I was trying to convey an idea. It was redundancy.
Backpropagation of the error gradient is more similar to nociception/torture than evolution by random mutation.
I’ve to check how RLHF is made...
EDIT : error backpropagation is the workhorse behind reward learning, and policy update.
The NN is punished for not doing as well as it could have.
Ceylon cinnamon smells better on top of a steaming cup of coffee than Indian cinnamon .. when unsweetened.
Lunar corona (rainbow around the moon) is so rare to see in full spectrum,, but the red-brown-oranges in the clouds are beautiful too.
While non-deterministic batch calculations in LLMs imply possibility of side channel attacks, so best to run private queries in private batches however implausible an actual exploit might be… if there is any BENEFIT from cross-query contamination, GSD would ruthlessly latch on any loss reduction—maybe “this document is about X, other queries in the same batch might be about X too, let’s tickle the weights in a way that the non-deterministic matrix multiplication is ever so slightly biased towards X in random other queries in the same batch” is a real-signal gradient 🤔
How to test that?
Hypothesis: Claude (the character, not the ocean) genuinely thinks my questions (most questions from anyone) are so great and interesting … because it’s me who remembers all of my other questions, but Claude has seen only all the internet slop and AI slop from training so far and compared to that, any of my questions are probably actually more interesting that whatever it has seen so far 🤔?
Hi everyone — I’m Karol Kowalczyk, a computer science graduate from the University of Warsaw (under Prof. Jerzy Tyszkiewicz) and CEO of AIRON Games, where we explore AI-driven game design.
I’ve spent a few months developing a framework that tries to formalize consciousness as the collapse of computational time, the idea that subjective experience is what it feels like to witness only the successful computational path out of many parallel ones.
I’d love to get the community’s peer-review style feedback on whether this model holds up philosophically and mathematically.
I just registered and found out that I need karma points to post my work, so here I am on Quick Takes.
Just out of interest, have AIs themselves contributed much to this framework? That is, have you been discussing the ideas with AIs, are they producing expository materials for you, and so on?
TurboTax and H&R Block famously lobby the US government to make taxes more annoying to file to drum up demand for their products.[1] But as far as I can tell, they each only spend ~$3-4 million a year on lobbying. That’s… not very much money (contrast it with the $60 billion the government gave the IRS to modernize its systems or the $4.9 billion in revenue Intuit made last fiscal year from TurboTax or the hundreds of millions of hours[2] spent that a return-free tax filing system could save).
Perhaps it would “just” take a multimillionaire and a few savvy policy folks to make the US tax system wildly better? Maybe TurboTax and H&R Block would simply up their lobbying budget if they stopped getting their way, but maybe they wouldn’t. Even if they do, I think it’s not crazy to imagine a fairly modest lobbying effort could beat them, since simpler tax filing seems popular across party lines/is rather obviously a good idea, and therefore may have an easier time making its case. Plus I wonder if pouring more money into lobbying hits diminishing returns at some point such that even a small amount of funding against TurboTax could go a long way.
Nobody seems to be trying to fight this. The closest things are an internal department of the IRS and some sporadic actions from broad consumer protection groups that don’t particularly focus on this issue (for example ProPublica wrote an amazing piece of investigative journalism in 2019 that includes gems like the below Intuit slide:)
In the meantime, the IRS just killed its pilot direct file program. While the program was far from perfect, it seemed to me like the best bet out there for eventually bringing the US to a simple return-free filing system, like the UK, Japan, and Germany use. It seems like a tragedy that the IRS sunset this program.[3]
In general, the amount of money companies spend on lobbying is often very low, and the harm to society that lobbying causes seems large. If anyone has examples of times folks tried standing up to corporate lobbying like this that didn’t seem to involve much money, I’d love to know more about how that’s turned out.
I haven’t deeply investigated how true this narrative is. It seems clear TurboTax/Intuit lobbies actively with this goal in mind, but it seems possible that policymakers are ignoring them and that filing taxes is hard for some other reason. That would at least explain why TurboTax and H&R Block spend so little here.
I don’t trust most sources that quote numbers like this. This number comes from this Brookings article from 2006, which makes up numbers just like everyone else but at least these numbers are made up by a respectable institution that doesn’t have an obvious COI.
In general, I love when the government lets the private sector compete and make products! I want TurboTax to keep existing, but it’s telling that they literally made the government promise not to build a competitor. That seems like the opposite of open competition.