sunwillrise comments on The Field of AI Alignment: A Postmortem, and What To Do About It

sunwillrise 26 Dec 2024 20:30 UTC
71 points
15
(Prefatory disclaimer that, admittedly as an outsider to this field, I absolutely disagree with the labeling of prosaic AI work as useless streetlighting, for reasons building upon what many commenters wrote in response to the very posts you linked here as assumed background material. But in the spirit of your post, I shall ignore that moving forward.)
The “What to Do About It” section dances around but doesn’t explicitly name one of the core challenges of theoretical agent-foundations work that aims to solve the “hard bits” of the alignment challenge, namely the seeming lack of reliable feedback loops that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium. As Conor Leahy concisely put it:
Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.
He was talking about philosophy in particular at that juncture, in response to Wei Dai’s concerns over metaphilosophical competence, but this point seems to me to generalize to a whole bunch of other areas as well. Indeed, I have talked about this before.
… and in my experience, there are people who can get traction on the core hard problems. Most notably physicists—when they grok the hard parts, they tend to immediately see footholds, rather than a blank impassable wall.
Do they get traction on “core hard problems” because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective on this matter, but I just… still completely disagree, for reasons other people have pointed out (see also Vanessa Kosoy’s comment here). Is this also supposed to be an implicitly assumed bit of background material?
And when we don’t have those verifying experiments at hand, do we not get stuff like string theory, where the math is beautiful and exquisite (in the domains it has been extended do) but debate by “physics postdocs” over whether it’s worthwhile to keep funding and pursuing it keeps raging on as a Theory of Everything keeps eliding our grasp? I’m sure people with more object-level expertise on this can correct my potential misconceptions if need be.
Idk man, some days I’m half-tempted to believe that all non-prosaic alignment work is a bunch of “streetlighting.” Yeah, it doesn’t result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness to claims about what will happen in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details serving as evidence whether the approaches are even on the right track. A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs. The contents and success stories of Vanessa Kosoy’s desiderata, or of your own search for natural abstractions, or of Alex Altair’s essence of agent foundations, or of Orthogonal’s QACI, etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind almost surely disqualifies them as useful-in-the-real-world when the endgame hits.
There’s a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don’t generalize well at all. And it’s only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you’ve chosen is in the former category as opposed to the latter.
(I’m briefly noting that I don’t fully endorse everything I said in the previous 2 paragraphs, and I realize that my framing is at least a bit confrontational and unfair. Separately, I acknowledge the existence of arguably-non-prosaic and mostly theoretical alignment approaches like davidad’s Open Agency Architecture, CHAI’s CIRL and utility uncertainty, Steve Byrnes’s work on brain-like AGI safety, etc., that don’t necessarily appear to fit this mold. I have varying opinions on the usefulness and viability of such approaches.)
What links here?
- sunwillrise's comment on Agent foundations: not really math, not really science by Alex_Altair (21 Aug 2025 12:26 UTC; 5 points)
- Noosphere89 27 Dec 2024 0:38 UTC
  15 points
  5
  Parent
  I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today’s AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I’m usually critical to agent foundations, but John Wentworth’s work is an exception that I’d like funding for).
  
  For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it’s likely that deeper factors are at play than just shallow similarity:
  
  https://www.lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-platonic-representation-hypothesis-beyond-in
  - sunwillrise 27 Dec 2024 9:21 UTC
    16 points
    0
    Parent
    I’m wary of a possible equivocation about what the “natural abstraction hypothesis” means here.
    If we are referring to the redundant information hypothesis and various kinds of selection theorems, this is a mathematical framework that could end up being correct, is not at all ungrounded, and Wentworth sure seems like the man for the job.
    But then you are still left with the task of grounding this framework in physical reality to allow you to make correct empirical predictions about and real-world interventions on what you will see from more advanced models. Our physical world abstracting well seems plausible (not necessarily >50% likely), and these abstractions being “natural” (e.g., in a category-theoretic sense) seems likely conditional on the first clause of this sentence being true, but I give an extremely low probability to the idea that these abstractions will be used by any given general intelligence or (more to the point) advanced AI model to a large and wide enough extent that retargeting the search is even close to possible.
    And indeed, it is the latter question that represents the make-or-break moment for natural abstractions’ theory of change, for it is only when the model in front of you (as opposed to some other idealized model) uses these specific abstractions that you can look through the AI’s internal concepts and find your desired alignment target.
    Rohin Shah has already explained the basic reasons why I believe the mesa-optimizer-type search probably won’t exist/be findable in the inner workings of the models we encounter: “Search is computationally inefficient relative to heuristics, and we’ll be selecting really hard on computational efficiency.” And indeed, when I look at the only general intelligences I have ever encountered in my entire existence thus far, namely humans, I see mostly just a kludge of impulses and heuristics that depend very strongly (almost entirely) on our specific architectural make-up and the contextual feedback we encounter in our path through life. Change either of those and the end result shifts massively.
    And even moving beyond that, is the concept of the number “three” a natural abstraction? Then I see entire collections and societies of (generally intelligent) human beings today who don’t adopt it. Are the notions of “pressure” and “temperature” and “entropy” natural abstractions? I look at all human beings in 1600 and note that not a single one of them had ever correctly conceptualized a formal version of any of those; and indeed, even making a conservative estimate of the human species (with an essentially unchanged modern cognitive architecture) having existed for 200k years, this means that for 99.8% of our species’ history, we had no understanding whatsoever of concepts as “universal” and “natural” as that. If you look at subatomic particles like electrons or stuff in quantum mechanics, the percentage manages to get even higher. And that’s only conditioning on abstractions about the outside world that we have eventually managed to figure out; what about the other unknown unknowns?
    For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it’s likely that deeper factors are at play than just shallow similarity
    I don’t think it shows that at all, since I have not been able to find any analysis of the methodology, data generation, discussion of results, etc. With no disrespect to the author (who surely wasn’t intending for his post to be taken as authoritative as a full paper in terms of updating towards his claim), this is shoddy science, or rather not science at all, just a context-free correlation matrix.
    Anyway, all this is probably more fit for a longer discussion at some point.
    - Thane Ruthenis 27 Dec 2024 16:54 UTC
      9 points
      0
      Parent
      Rohin Shah has already explained the basic reasons why I believe the mesa-optimizer-type search probably won’t exist/be findable in the inner workings of the models we encounter: “Search is computationally inefficient relative to heuristics, and we’ll be selecting really hard on computational efficiency.”
      I think this statement is quite ironic in retrospect, given how OpenAI’s o-series seems to work (at train-time and at inference-time both), and how much AI researchers hype it up.
      By contrast, my understanding is that the sort of search John is talking about retargeting isn’t the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search.
      So it is in fact the ML researchers now who believe in the superiority of the computationally inefficient search; not the agency theorists.
      - Noosphere89 27 Dec 2024 17:27 UTC
        13 points
        4
        Parent
        Re the OpenAI o-series and search, my initial prediction is that Q*/MCTS search will work well on problems that are easy to verify and and easy to get training data for, and not work if either of these 2 conditions are violated, and secondarily will be reliant on the model having good error correction capabilities to use the search effectively, which is why I expect we can make RL capable of superhuman performance on mathematics/programming with some rather moderate schlep/drudge work, and I also expect cost reductions such that it can actually be practical, but I’m only giving a ⁵⁰⁄₅₀ chance by 2028 for superhuman performance as measured by benchmarks in these domains.
        I think my main difference from you, Thane Ruthenis is I expect costs to reduce surprisingly rapidly, though this is admittedly untested.
        This will accelerate AI progress, but not immediately cause an AI explosion, though in the more extreme paces this could create something like a scenario where programming companies are founded by a few people smartly managing a lot of programming AIs, and programming/mathematics experiencing something like what happened to the news industry from the rise of the internet, where there was a lot of bankruptcy of the middle end, the top end won big, and most people are in the bottom end.
        Also, correct point on how a lot of people’s conceptions of search are babble-and-prune, not top down search like MCTS/Q*/BFS/DFS/A* (not specifically targeted at sunwillrise here).
        By contrast, my understanding is that the sort of search John is talking about retargeting isn’t the brute-force babble-and-prune algorithms, but a top-down heuristical-constraint-based search.
        
        Thane Ruthenis 27 Dec 2024 17:53 UTC
        9 points
        0
        Parent
        I’m not strongly committed to the view that the costs won’t rapidly reduce: I can certainly see the worlds in which it’s possible to efficiently distill trees-of-thought unrolls into single chains of thoughts. Perhaps it scales iteratively, where we train a ML model to handle the next layer of complexity by generating big ToTs, distilling them into CoTs, then generating the next layer of ToTs using these more-competent CoTs, etc.
        Or perhaps distillation doesn’t work that well, and the training/inference costs grow exponentially (combinatorially?).
        Noosphere89 27 Dec 2024 18:07 UTC
        2 points
        2
        Parent
        Yeah, we will have to wait at least several years.
        
        One confound in all of this is that big talent is moving out of OpenAI, which means I’m more bearish on the company’s future prospects specifically without it being that much of a detriment towards progress towards AGI.
      - Rohin Shah 29 Dec 2024 16:17 UTC
        11 points
        2
        Parent
        I think this statement is quite ironic in retrospect, given how OpenAI’s o-series seems to work
        I stand by my statement and don’t think anything about the o-series model invalidates it.
        And to be clear, I’ve expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I’m thinking of, I made the critiques privately.)
        There’s a world of difference between “you can get better results by thinking longer” (yeah, obviously this was going to happen) and “the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search” (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).
        Thane Ruthenis 29 Dec 2024 16:42 UTC
        4 points
        0
        Parent
        To lay out my arguments properly:
        “Search is ruinously computationally inefficient” does not work as a counter-argument against the retargetability of search, because the inefficiency argument applies to babble-and-prune search, not to the top-down heuristical-constraint-based search that was/is being discussed.
        There are valid arguments against easily-retargetable heuristics-based search as well (I do expect many learned ML algorithms to be much messier than that). But this isn’t one of them.
        ML researchers are currently incredibly excited about the inference-time scaling laws, talking about inference runs costing millions/billions of dollars, and how much capability will be unlocked this way.
        The o-series paradigm would use this compute to, essentially, perform babble-and-prune search. The pruning would seem to be done by some easily-swappable evaluator (either the system’s own judgement based on the target specified in a prompt, or an external theorem-prover, etc.).
        If things will indeed go this way, then it would seem that a massive amount of capabilities will be based on highly inefficient babble-and-prune search, and that this search would be easily retargetable by intervening on one compact element of the system (the prompt, or the evaluator function).
        Rohin Shah 29 Dec 2024 17:05 UTC
        8 points
        2
        Parent
        Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just “computational inefficiency”.
        Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn’t new to the o-series, you can also “retarget” any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as “it turns out alignment wasn’t a problem” (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I’d summarize this as “x-risky misalignment just doesn’t happen by default”, which I agree is plausible (see e.g. here), but when I’m talking about the viability of alignment plans like “retarget the search” I generally am assuming that there is some problem to solve.
        (Also, random nitpick, who is talking about inference runs of billions of dollars???)
        Thane Ruthenis 29 Dec 2024 17:39 UTC
        4 points
        0
        Parent
        Yup, I read through it after writing the previous response and now see that you don’t need to be convinced of that point. Sorry about dragging you into this.
        I could nitpick the details here, but I think the discussion has kind of wandered away from any pivotal points of disagreement, plus John didn’t want object-level arguments under this post. So I petition to leave it at that.
        Also, random nitpick, who is talking about inference runs of billions of dollars???
        There’s a log-scaling curve, OpenAI have already spent on the order of a million dollars just to score well on some benchmarks, and people are talking about “how much would you be willing to pay for the proof of the Riemann Hypothesis?”. It seems like a straightforward conclusion that if o-series/inference-time scaling works as well as ML researchers seem to hope, there’d be billion-dollar inference runs funded by some major institutions.
        Rohin Shah 29 Dec 2024 19:07 UTC
        6 points
        2
        Parent
        OpenAI have already spent on the order of a million dollars just to score well on some benchmarks
        Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn’t specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.
    - Noosphere89 27 Dec 2024 15:15 UTC
      2 points
      0
      Parent
      Yeah, it hasn’t been shown that these abstractions can ultimately be retargeted by default for today’s AI.
- Jozdien 27 Dec 2024 13:05 UTC
  6 points
  2
  Parent
  The contents and success stories of Vanessa Kosoy’s desiderata, or of your own search for natural abstractions, or of Alex Altair’s essence of agent foundations, or of Orthogonal’s QACI, etc., seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind almost surely disqualifies them as useful-in-the-real-world when the endgame hits.
  (I haven’t read your comments you link, so apologies if you’ve already responded to this point before).
  I can’t speak to most of these simply out of lack of deep familiarity, but I don’t think natural abstractions is disqualified at all by this.
  What do we actually want out of interpretability? I don’t think mechanistic interpretability, as it stands currently, gives us explanations of the form we actually want. For example, what are a model’s goals? Is it being deceptive? To get answers to those questions, you want to first know what those properties actually look like—you can’t get away with identifying activations corresponding to how to deceive humans, because those could relate to a great number of things (e.g. modelling other deceptive agents). Composability is a very non-trivial problem.
  If you want to answer those questions, you need to find a way to get better measures of whatever property you want to understand. This is the central idea behind Eliciting Latent Knowledge and other work that aims for unsupervised honesty (where the property is honesty), what I call high-level interpretability of inner search / objectives, etc.
  Natural abstractions is more agnostic about what kinds of properties we would care about, and tries to identify universal building blocks for any high-level property like this. I am much more optimistic about picking a property and going with it, and I think this makes the problem easier, but that seems like a different disagreement than yours considering both are inevitably somewhat conceptual and require more prescriptive work than work focusing solely on frontier models.
  If you wanted to get good handles to steer your model at all, you’re going to have to do something like figuring out the nature of the properties you care about. You can definitely make that probem easier by focusing on how those properties instantiate in specific classes of systems like LLMs or neural nets (and I do in my work), but you still have to deal with a similar version of the problem in the end. John is sceptical enough of this paradigm being the one that leads us to AGI that he doesn’t want to bet heavily on his work only being relevant if that turns out to be true, which I think is pretty reasonable.
  (These next few sentences aren’t targeted at you in particular). I often see claims made of the form: “[any work that doesn’t look like working directly with LLMs] hasn’t updated on the fact that LLMs happened”. Sometimes that’s true! But very commonly I also see the claim made without understanding what that work is actually trying to do, or what kind of work we would need to reliably align / interpret super-intelligent LLMs-with-RL. I don’t know whether it’s true of the other agent foundations work you link to, but I definitely don’t think natural abstractions hasn’t updated on LLMs being the current paradigm.
  Do they get traction on “core hard problems” because of how Inherently Awesome they are as researchers, or do they do so because the types of physics problems we mostly care about currently are such that, while the generation of (worthwhile) grand mathematical theories is hard, verifying them is (comparatively) easier because we can run a bunch of experiments (or observe astronomical data etc., in the super-macro scale) to see if the answers they spit out comply with reality? I am aware of your general perspective on this matter, but I just… still completely disagree, for reasons other people have pointed out (see also Vanessa Kosoy’s comment here). Is this also supposed to be an implicitly assumed bit of background material?
  Agreed that this is a plausible explanation of what’s going on. I think that the bottleneck on working on good directions in alignment is different though, so I don’t think the analogy carries over very well. I think that reliable feedback loops are very important in alignment research as well to be clear, I just don’t think the connection to physicists routes through that.
- mesaoptimizer 27 Dec 2024 13:45 UTC
  4 points
  1
  Parent
  Even if I’d agree with your conclusion, your argument seems quite incorrect to me.
  
  the seeming lack of reliable feedback loops that give you some indication that you are pushing towards something practically useful in the end instead of just a bunch of cool math that nonetheless resides alone in its separate magisterium
  
  That’s what math always is. The applicability of any math depends on how well the mathematical models reflect the situation involved.
  
  would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don’t generalize well at all. And it’s only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you’ve chosen is in the former category as opposed to the latter.
  
  It seems very unlikely to me that you’d have many ‘similar-looking mathematical models’. If a class of real-world situations seems to be abstracted in multiple ways such that you have hundreds (not even millions) of mathematical models that supposedly could capture its essence, maybe you are making a mistake somewhere in your modelling. Abstract away the variations. From my experience, you may have a small bunch of mathematical models that could likely capture the essence of the class of real-world situations, and you may debate with your friends about which one is more appropriate, but you will not have ‘multiple similar-looking models’.
  
  Nevertheless, I agree with your general sentiment. I feel like humans will find it quite difficult make research progress without concrete feedback loops, and that actually trying stuff with existing examples of models (that is, the stuff that Anthropic and Apollo are doing, for example) provide valuable data points.
  
  I also recommend maybe not spending so much time reading LessWrong and instead reading STEM textbooks.
- TristanTrim 13 Aug 2025 0:38 UTC
  3 points
  0
  Parent
  
  seem entirely insensitive to the fact that we are currently dealing with multimodal LLMs combined with RL instead of some other paradigm, which in my mind almost surely disqualifies them as useful-in-the-real-world when the endgame hits.
  
  I am unfortunately reminded of this song. “They all try to keep up while we f*** this shit up”. Yeah. Most people focusing on the hard problems from what I can tell are hoping we will get lucky and there will be some breakthrough or we will pause for time for their slow progress to actually solve the problem. Yeah, getting math and models correct is hard but without math and models we only have wishful thinking, which is worse.