Co-Director (Research) at PIBBSS
Previously: Applied epistemologist and Research Engineer at Conjecture.
Lucas Teixeira
I’m curious if you have a sense of:
1. What the target goal of early-crunch time research should be (i.e. control safety case for the specific model one has at the present moment, trustworthy case for this specific model, trustworthy safety case for the specific model and deference case for future models, trustworthy safety case for all future models, etc...)
2. The rough shape(s) of that case (i.e. white-box evaluations, control guardrails, convergence guarantees, etc...)
3. What kinds of evidence you expect to accumulate given access to these early powerful models.
I expect I disagree with the view presented, but without clarification on the points above I’m not certain. I also expect my cruxes would route through these points
Is this vibes or was there some kind of study done?
I think I get it now. I was confused on how, under your model, we would continue to generate sequence of thoughts with thematic consistency i.e. [thinking about cake → planning to buy a cake-> buying a cake-> eating a cake] as opposed to ones which aren’t thematically consistent i.e. [think about cake → take a nap → call your friend → scratch an itch].
Two things are apparent to me now:Valence is conditional on current needs
Latent states of the thought generator are themselves inputs to the next thoughts being generated
I expect I’m still confused about valence, but may ask follow up questions on another thread in a more relevant post. Thanks for the reply!
Nit: The title give the impression of a demonstrated result as opposed to a working hypothesis and proposed experiment.
I’m curious on how/if goal coherence over long term plans is explained by your “planning as reward shaping” model? If planning amounts to an escalation of more and more “real thoughts” (i.e. I’m idly thinking about prinsesstårta → A fork-full of prinsesstårta is heading towards my mouth), because these correspond to stronger activations in a valenced latent in my world model, and my thought generator is biased towards producing higher valence thoughts, it’s unclear to me why we wouldn’t just default to the production of untopical thoughts (i.e. I’m idly thinking about prinsesstårta → I’m thinking about underneath a weighted blanket) and never get anything done in the world.
One reply would be to bite the bullet and say yup, humans due in fact have deficits in their long term planning strategies and this accounts for them but this feels unsatisfying; if the story given in my comment above was the only mechanism I’d expect us to be much worse. One possible reply is that “non-real thoughts” don’t reliably lead towards rewards from the steering subsystem and the thought assessors down weight the valence associated w/ these thoughts thus leading to them being generated w/ lower frequency; consequently then, the only thought sequences which remain are ones which terminate in “real thoughts” and stimulate accurate predictions of the steering subsystem. This seems plausibly sufficient, but it still doesn’t answer the question of why people don’t arbitrarily switch into “equally real but non-topical” thought sequences at higher frequencies.
FWIW, links to the references point back to localhost
Sure, but I think that misses the point that I was trying to convey. If we end up in a world similar to the ones forecasted in ai-2027, the fraction of compute which labs allocate towards speeding up their own research threads will be larger than the amount of compute which labs will sell for public consumption.
My view is that even in worlds with significant speed ups in R&D, we still ultimately care about the relative speed of progress on scalable alignment (in the Christiano sense) compared to capabilities & prosaic safety; doesn’t matter if we finish quicker if catastrophic ai is finished quickest. Thus, an effective TOC for speeding up long horizon research would still route through convincing lab leadership of the pursuitworthiness of research streams.
Labs do have a moat around compute. In the worlds where automated R&D gets unlocked I would expect compute allocation to substantially pivot, making non-industrial automated research efforts non-competitive.
As far as I am concerned, AGI should be able to do any intellectual task that a human can do. I think that inventing important new ideas tends to take at least a month, but possibly the length of a PhD thesis. So it seems to be a reasonable interpretation that we might see human level AI around mid-2030 to 2040, which happens to be about my personal median.
There is an argument to be made that at the larger scales of length, cognitive tasks become cleanly factored, or in other words it’s more accurate to model completing something like a PhD as different instantiations of yourself coordinating across time over low bandwidth channels, as opposed to you doing very high dimensional inference for a very long time. If that’s the case, then one would expect to roughly match human performance in indefinite time horizon tasks once that scale has been reached.
I don’t think I fully buy this, but I don’t outright reject it.
I believe intelligence is pretty sophisticated while others seem to think it’s mostly brute force. This tangent would however require a longer discussion on the proper interpretation of Sutton’s bitter lesson.
I’d be interested in seeing this point fleshed out, as it’s a personal crux of mine (and I expect many others). The bullish argument which I’m compelled by goes something along the lines of:
Bitter Lesson: SGD is a much
betterscalable optimizer than you, and we’re bringing it to pretty stupendous scalesLots of Free Energy in Research Engineering: My model of R&D in frontier AI is that it is often blocked by a lot of tedious and laborious engineering. It doesn’t take a stroke of genius to think of RL on CoT; it took (comparatively) quite a while to get it to work.
Low Threshold in Iterating Engineering Paradigms: Take a technology, scale it, find it’s limits, pivot, repeat. There were many legitimate arguments floating around last year around the parallelism tradeoff and shortcut generalization which seemed to suggest limits of scaling pretraining. I take these to basically be correct, it just wasn’t that hard to pivot towards a nearby paradigm which didn’t face similar limits. I expect similar arguments to crop up around the limits of model-free RL, or OOD generalization of training on verifiable domains, or training on lossy representations of the real world (language), or inference on fixed weight recurrence, or… I expect (many) of them to basically be correct, I just don’t expect the pivot towards a scalable solution to these to be that hard. Or in other words, I expect that much of the effort that comes from unlocking these new engineering paradigms to be made up of engineering hours which we expect to be largely automated.
Interesting. Curious to know what your construction ended up looking like and I’m looking forward to reading the resulting proof!
I see it now
so here you go, I made this for you
I don’t see a flow chart
Strong upvote. Very clearly written and communicated. I’ve been recently thinking about digging deeper into this paper with the hopes of potentially relating it to some recent causality based interpretability work and reading this distillation has accelerated my understanding of the paper. Looking forward to the rest of the sequence!
Phi-4 is highly capable not despite but because of synthetic data.
Imitation models tend to be quite brittle outside of their narrowly imitated domain, and I suspect the same to be the case for phi-4. Some of the decontamination measures they took provide some counter evidence to this but not much. I’d update more strongly if I saw results on benchmarks which contained in them the generality and diversity of tasks required to do meaningful autonomous cognitive labour “in the wild”, such as SWE-Bench (or rather what I understand SWE-Bench to be, I have yet to play very closely with it).
Phi-4 is taught by GPT-4; GPT-5 is being taught by o1; GPT-6 will teach itself.
There’s an important distinction between utilizing synthetic data in teacher-student setups and utilizing synthetic data in self-teaching. While synthetic data is a demonstrably powerful way of augmenting human feedback, my current estimation is that typical mode collapse arguments still hold for self generated purely synthetic datasets, and that phi-4 doesn’t provide counter-evidence against this.
I’m curious how these claims relate to what’s proposed by this paper. (note, I haven’t read either in depth)
I’m curious what your read of the history is, here? My impression is that most important paradigm-forming work so far has involved empirical feedback somehow, but often in ways exceedingly dissimilar from/illegible to prevailing scientific and engineering practice.
I have a hard time imagining scientists like e.g. Darwin, Carnot, or Shannon describing their work as depending much on “immediate feedback loops with present day” systems.
Thanks for the comment @Adam Scholl and apologies for not addressing it sooner, it was on my list but then time flew. I think we’re in qualitative agreement that non-paradigmatic research tends to have empirical feedback loops, and that the forms and methods of empirical engagement undergo qualitative changes in the formation of paradigms. I suspect we may have quantitative disagreements with how illegible these methods were to previous practitioners, but I don’t expect that to be super cruxy.
The position which I would argue against is that the issue of empirical access to ASI necessitates long bouts of philosophical thinking prior to empirical engagement and theorization. The position which I would argue for is that there is significant (and depending on the crowd undervalued) benefit to be gained for conceptual innovation by having research communities which value quick and empirical feedback loops. I’m not an expert on either of these historical periods, but I would be surprised to hear that Carnot or Shannon did not meaningfully benefit from engaging with the practical industrial advancements of their day.
Giving my full models is out of scope for a comment and would take a sequence which I’ll probably never write, but the 3 history and philosophy of science references which have had the greatest impact on my thinking around empiricism which I tend to point people towards would probably be Inventing Temperature, Exploratory Experiments, and Representing and Intervening.So I’m curious whether you think PIBBSS would admit researchers like these into your program, were they around and pursuing similar strategies today?
In short I would say yes, because I don’t believe the criteria listed above excludes the researchers which you called attention to. But independently of whether you buy into that claim, I would stress that different programs have different mechanisms of admission. The affiliateship as it’s currently being run is designed for lower variance and is incidentally more tightly correlated with the research tastes of myself and the horizon scanning team given that these are the folks providing the support for it. The summer fellowship is designed for higher variance and goes through a longer admission process involving a selection committee, with the final decisions falling on mentors.
Why are you sure that effective “evals” can exist even in principle?
Relatedly, the point which is least clear to me is what exactly would it mean to solve the “proper elicitation problem” and what exactly are the “requirements” laid out by the blue line on the graph. I think I’d need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).
As a non-example, possessing the kind of foundational scientific understanding which would allow someone to confidently say “We have run this evaluation suite and we now know once and for all that this system is definitely not capable of x, regardless of whatever elicitation techniques are developed in the future” seems me to be Science-of-AI-complete and is thus a non-starter for a north star for an agenda aimed at developing stronger inability arguments.
When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:
More SWE-Benchesque evaluations across critical domains which are perhaps “more Verified” by having higher quality expert judgement passed upon them.
Some kind of library which brings together a family of different SOTA prompting and finetuning recipes to apply to any evaluation scenario.
More data points and stronger forecasts for post training enhancements (PTEs).
Which would allow us to make the claim “Given these trends in PTEs, and this coverage in evaluations, experts have vibed out that the probability that of this model being capable of producing catastrophe x is under an acceptable threshold” for a wider range of domains. To be clear, that’s a better place than we are now and something worth striving for but not something which I would qualify as “having solved the elicitation problem”. There are fundamental limitations to the kinds of claims which black box evaluations can reasonably support, and if we are to posit that the “elicitation gap” is solvable it needs have the right sorts of qualifications, amendments and hedging such that it’s on the right side of this fundamental divide.
Note, I don’t work on evals and expect that others have better models than this. My guess is that @Marius Hobbhahn has strong hopes on the field developing more formal statistical guarantees and other meta-evaluative practices as outlined in the references in the science of evals post, and would thus predict a stronger safety case sketch than the one laid out in the previous paragraph, but what the type signature of that sketch would be, and consequently how reasonable this sketch is given fundamental limitations of black box evaluations, is currently unclear to me.
As someone who has tried to engage w/ Live Theory multiple times and has found it intriguing, suspicious, and frustratingly mysterious I was very happy to see this write up. I was sad that the Q&A was announced in short notice. Am looking forward to watching the recording and am registering interest in attending a future one.