Coefficient Giving is one of the worst name changes I’ve ever heard:
Coefficient Giving sounds bad while OpenPhil sounded cool and snappy.
Coefficient doesn’t really mean anything in this context, clearly it’s a pun on “co” and “efficient” but that is also confusing. They say “A coefficient multiplies the value of whatever it’s paired with” but that’s just true of any number?
They’re a grantmaker who don’t really advise normal individuals about where to give their money, so why “Giving” when their main thing is soliciting large philanthropic efforts and then auditing that
Coefficient Giving doesn’t tell you what the company does at the start! “Good Ventures” and “GiveWell” tell you roughly what the company is doing.
“Coefficient” is a really weird word, so you’re burning weirdness points with the literal first thing anyone will ever hear you say, this seems like a name which you would only thing is good if you’re already deep into rat/ea spaces.
It sounds bad. Open Philanthropy rolls off the tongue, as does OpenPhil. OH-puhn fi-LAN-thruh-pee. Two sets of three. CO-uh-fish-unt GI-ving is an awkward four-two with a half-emphasis on the fish of coefficient. Sounds bad. I’m coming back to this point but there is no possible shortening other than “Coefficient” which is bad because it’s just an abstract noun and not very identifiable, whereas “OpenPhil” was a unique identifier. CoGive maybe, but then you have two stressed syllables which is awkward. It mildly offends my tongue to have to even utter their name.
Clearly OP wanted to shed their existing reputation, but man, this is a really bad name choice.
Apparently the thing of “people mixed up openphil with other orgs” (in particular OpenAI’s non-profit and the open society foundation) was a significantly bigger problem than I’d have thought — recurringly happening even in pretty high-stakes situations. (Like important grant applicants being confused.) And most of these misunderstandings wouldn’t even have been visible to employees.
And arguably this was just about to get even worse with the newly launched “OpenAI foundation” sounding even more similar.
Coefficient Giving doesn’t tell you what we do, but neither did Open Philanthropy. And fwiw, neither does Good Ventures, IMO—or many nonprofits, e.g. Lightcone Infrastructure, Redwood Research, the Red Cross. Agreed that GiveWell is a very descriptive and good name though.
Setting aside whether “coefficient” is a weird word, I don’t think having an unusual word in your name “burns weirdness points” in a costly way. Take some of the world’s biggest companies—Nvidia (“invidious,”) Google (“googol”), Meta—these aren’t especially common words, but it seems to have worked out for them.
The emphasis of “coefficient” is on the “fish.” So it’s not an “awkward four-two,” it’s three sets of two, which seems melodic enough (cf. 80,000 Hours, Lightcone Infrastructure, European Union, Make-A-Wish Foundation, etc).
On the no possible shortening, again, time will tell, but my money is on “CG,” which seems fine.
On a different level, “philanthropy” is less weird in the name of a philanthropy org. It’s also doing work. If someone has to look up what “philanthropy” means, then they become less confused. If they do that for “coefficient” then they’re just even more confused. It’s also the case that basically anyone can understand what “philanthropy” means given a one-sentence description, which isn’t as easily the case for “coefficient” (I actually don’t know a good definition for “coefficient” off the top of my head, despite the fact that I can name several coefficients).
As a non-native English speaker, OpenPhil was sooo much easier to pronounce than Coefficient Giving. I’m sure this shouldn’t play a big part in the naming decision, but still...
I think this is a case of a curb cut effect. If it’s easy (vs hard) to pronounce for non-native speakers, it’s also easy (vs hard) to get the point across at a noisy party, or over a crackly phone line, or if someone’s distracted.
At least their website now explains what their do. It took me a very long time to understand what OpenPhil does and where they get the money from. Now it’s all finally explained on the front page.
Oh yeah and as my partner pointed out to me today. While “Coefficients multiply whatever they are next to” lots of things called “coefficients” we commonly encounter have values smaller than 1 (e.g. the coefficient of friction, drag coefficient, coefficient of restitution all commonly have values <1)
From what I understand, the issue was mostly with the “Open” part (because of mis-association with OpenAI and also because OP is no longer “open” in the sense that they decided not to disclose some of their donations (for whatever reasons)), but then they could just go for something like, idk, MaxiValPhil, which is less pleasant to pronounce and less pleasantly sounding than OpenPhil, but:
it doesn’t sound as bad as Coefficient Giving
is easier to pronounce
doesn’t spend weirdness points by using an uncommon word “coefficient”
communicates what it’s about (even people who are not used to thinking in terms of expected value will mostly correctly guess what “Maximum Value Philantropy” wants to do)
(As a very minor thing, algebraists / category theorists will be making jokes that they’re the opposite of efficient.)
“Coefficient Giving sounds bad while OpenPhil sounded cool and snappy.”—OpenPhil just sounds better because it’s shorter. I imagine that instead of saying the full name, Coefficient Giving will soon acquire some similar sort of nickname—probably people will just say “Coefficient”, which sounds kinda cool IMO. I could also picture people writing “Coeff” as shorthand, although it would be weird to try and say “Coeff” out loud.
I could also get used to saying “cGive” pronounced similar to “sea Give” which has nice connotations spoken aloud and has c as coefficient right in the written version. But I agree that “Coefficient” has a good sound compared to which “cGive” seems more generic
I like Gradient Giving. “A gradient is the direction of fastest increase” would have also been a good explanation and literally reflects what they’re trying to do. It rhymes with “radient,” which sounds optimistic.
I feel like they’re trying to brand more in the direction of being advisors for others’ money, but aren’t willing to go all the way? I’m not sure why, I guess they want to keep more relative power in the relationships they build.
We used Monte Carlo simulations to estimate, for various sentience models and across eighteen organisms, the distribution of plausible probabilities of sentience.
We used a similar simulation procedure to estimate the distribution of welfare ranges for eleven of these eighteen organisms, taking into account uncertainty in model choice, the presence of proxies relevant to welfare capacity, and the organisms’ probabilities of sentience (equating this probability with the probability of moral patienthood)
Now with the disclaimer that I do think that RP are doing good and important work and are one of the few organizations seriously thinking about animal welfare priorities...
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons. The authors get around this by refusing to stake themselves on any theory of consciousness.
The overall structure of the RP welfare range report, does not cut to the truth, instead the core mental motion seems to be to engage with as many existing piece of work as possible; credence is doled out to different schools of thought and pieces of evidence in a way which seems more like appeasement, lip-service, or a “well these guys have done some work, who are we disrespect them by ignoring it” attitude. Removal of noise is one of the most important functions of meta-analysis, and it is largely absent.
The result of this is an epistemology where the accuracy of a piece of work is a monotonically increasing function of the number of sources, theories, and lines of argument. Which is fine if your desired output is a very long Google doc, and a disclaimer to yourself (and, more cynically, your funders) that “No no, we did everything right, we reviewed all the evidence and took it all into account.” but it’s pretty bad if you want to actually be correct.
I grow increasingly convinced that the epistemics of EA are not especially good, worsening, and already insufficient to work on the relatively low-stakes and easy issue of animal welfare (as compared to AI x-risk).
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons.
epistemic status: Disagreeing on object-level topic, not the topic of EA epistemics.
I disagree, especially functionalism can justify a number like this. Here’s an example for reasoning on this:
Suffering is the structure of some computation, and different levels of suffering correspond to different variants of that computation.
What matters is whether that computation is happening.
The structure of suffering is simple enough to be represented in the neurons of a shrimp.
Under that view, shrimp can absolutely suffer in the same range as humans, and the amount of suffering is dependent on crossing some threshold of number of neurons. One might argue that higher levels of suffering require computations with higher complexity, but intuitively I don’t buy this—more/purer suffering appears less complicated to me, on introspection (just as higher/purer pleasure appears less complicated as well.)
I think I put a bunch of probability mass on a view like above.
(One might argue that it’s about the number of times the suffering computation is executed, not whether it’s present or not, but I find that view intuitively less plausible.)
You didn’t link the report and I’m not able to make it out from all of the Rethink Priorities moral weight research, so I can’t agree/disagree on the state of EA epistemics shown in there.
As to your point: this is one of the better arguments I’ve heard that welfare ranges might be similar between animals. Still I don’t think it squares well with the actual nature of the brain. Saying there’s a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.
One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.
In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic—the main points are (writing from memory, might not be entirely accurate):
humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection.
dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem
If we consider the question from an evolutionary angle, I’d also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.
I agree with you that the “structure of suffering” is likely to be represented in the neurons of shrimp. I think it’s clear that shrimps may “suffer” in the sense that they react to pain, move away from sources of pain, would prefer to be in a painless state rather than a painful state, etc.
But where I diverge from the conclusions drawn by Rethink Priorities is that I believe shrimp are less “conscious” (for a lack of a better word) than humans and less their suffering matters less. Though shrimp show outward signs of pain, I sincerely doubt that with just 100,000 neurons there’s much of a subjective experience going on there. This is purely intuitive, and I’m not sure of the specific neuroscience of shrimp brains or Rethink Priorities arguments against this. But it seems to me that the “level of consciousness” animals have sit on an axis that’s roughly correlated with neuron count; with humans elephants at the top to C. elegans at the bottom.
Another analogy I’ll throw out is that humans can react to pain unconsciously. If you put your hand on a hot stove, you will reactively pull your hand away before the feeling of pain enters your conscious perception. I’d guess shrimp pain response works a similar way, largely unconscious processing do to their very low neuron count.
“In regards to intelligence, we can question both the extent to which more neurons are correlated with intelligence and whether more intelligence in fact predicts greater moral weight;
Many ways of arguing that more neurons results in more valenced consciousness seem incompatible with our current understanding of how the brain is likely to work; and
There is no straightforward empirical evidence or compelling conceptual arguments indicating that relative differences in neuron counts within or between species reliably predicts welfare relevant functional capacities.
Overall, we suggest that neuron counts should not be used as a sole proxy for moral weight, but cannot be dismissed entirely”
This hardly seems an argument against the one in the shortform, namely
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons. The authors get around this by refusing to stake themselves on any theory of consciousness.
If the original authors never thought of this that seems on them.
Are there any high p(doom) orgs who are focused on the following:
Pick an alignment “plan” from a frontier lab (or org like AISI)
Demonstrate how the plan breaks or doesn’t work
Present this clearly and legibly for policymakers
Seems like this is a good way for people to deploy technical talent in a way which is tractable. There are a lot of people who are smart but not alignment-solving levels of smart who are currently not really able to help.
AI companies don’t actually have specific plans, they mostly just hope that they’ll be able to iterate. (See Sam Bowman’s bumper post for an articulation of a plan like this.) I think this is a reasonable approach in principle: this is how progress happens in a lot of fields. For example, the AI companies don’t have plans for all kinds of problems that will arise with their capabilities research in the next few years, they just hope to figure it out as they get there. But the lack of specific proposals makes it harder to demonstrate particular flaws.
A lot of my concerns about alignment proposals are that when AIs are sufficiently smart, the plan won’t work anymore. But in many cases, the plan does actually work fine right now at ensuring particular alignment properties. (Most obviously, right now, AIs are so bad at reasoning about training processes that scheming isn’t that much of an active concern.) So you can’t directly demonstrate that current plans will fail later without making analogies to future systems; and observers (reasonably enough) are less persuaded by evidence that requires you to assess the analogousness of a setup.
(Psst: a lot of AISI’s work is this, and they have sufficient independence and expertise credentials to be quite credible; this doesn’t go for all of their work, some of which is indeed ‘try for a better plan’)
(There are projects that stress-test the assumptions behind AGI labs’ plans, of course, but I don’t think anyone is (1) deliberately picking at the plans AGI labs claim, in a basically adversarial manner, (2) optimizing experimental setups and results for legibility to policymakers, rather than for convincingness to other AI researchers. Explicitly setting those priorities might be useful.)
optimizing experimental setups and results for legibility to policymakers, rather than for convincingness to other AI researchers.
People who do research like this are definitely optimizing for legibility to policymakers (always at least a bit, and usually a lot).
One problem is that if AI researchers think your work is misleading/scientifically suspect, they get annoyed at you and tell people that your research sucks and you’re a dishonest ideologue. This is IMO often a healthy immune response, though it’s a bummer when you think that the researchers are wrong and your work is fine. So I think it’s pretty costly to give up on convincingness to AI researchers.
“Not optimized to be convincing to AI researchers” ≠ “looks like fraud”. “Optimized to be convincing to policymakers” might involve research that clearly demonstrates some property of AIs/ML models which is basic knowledge for capability researchers (and for which they already came up with rationalizations why it’s totally fine) but isn’t well-known outside specialist circles.
E. g., the basic example is the fact that ML models are black boxes trained by an autonomous process which we don’t understand, instead of manually coded symbolic programs. This isn’t as well-known outside ML communities as one might think, and non-specialists are frequently shocked when they properly understand that fact.
What kind of “research” would demonstrate that ML models are not the same as manually coded programs? Why not just link to the Wikipedia article for “machine learning”?
My impression is that the current Real Actual Alignment Plan For Real This Time amongst medium p(Doom) people looks something like this:
Advance AI control, evals, and monitoring as much as possible now
Try and catch an AI doing a maximally-incriminating thing at roughly human level
This causes [something something better governance to buy time]
Use the almost-world-ending AI to “automate alignment research”
(Ignoring the possibility of a pivotal act to shut down AI research. Most people I talk to don’t think this is reasonable.)
I’ll ignore the practicality of 3. What do people expect 4 to look like? What does an AI assisted value alignment solution look like?
My rough guess of what it could be, i.e. the highest p(solution is this|AI gives us a real alignment solution) is something like the following. This tries to straddle the line between the helper AI being obviously powerful enough to kill us and obviously too dumb to solve alignment:
Formalize the concept of “empowerment of an agent” as a property of causal networks with the help of theorem-proving AI.
Modify today’s autoregressive reasoning models into something more isomorphic to a symbolic casual network. Use some sort of minimal circuit system (mixture of depths?) and prove isomorphism between the reasoning traces and the external environment.
Identify “humans” in the symbolic world-model, using automated mech interp.
Target a search of outcomes towards the empowerment of humans, as defined in 1.
Is this what people are hoping plops out of an automated alignment researcher? I sometimes get the impression that most people have no idea whatsoever how the plan works, which means they’re imagining the alignment-AI to be essentially magic. The problem with this is that magic-level AI is definitely powerful enough to just kill everyone.
Is your question more about “what’s the actual structure of the ‘solve alignment’ part”, or “how are you supposed to use powerful AIs to help with this?”
I think there’s one structure-of-plan that is sort of like your outline (I think it is similar to John Wentworth’s plan but sort of skipping ahead past some steps and being more-specific-about-the-final-solution which means more wrong)
(I don’t think John self-identifies as particularly oriented around your “4 steps from AI control to automate alignment research”. I haven’t heard the people who say ‘let’s automate alignment research’ say anything that sounded very coherent. I think many people are thinking something like “what if we had a LOT of interpretability?” but IMO not really thinking through the next steps needed for that interpretability to be useful in the endgame.)
STEM AI → Pivotal Act
I haven’t heard anyone talk about this for awhile, but a few years back I heard a cluster of plans that were something like “build STEM AI with very narrow ability to think, which you could be confident couldn’t model humans at all, which would only think about resources inside a 10′ by 10′ cube, and then use that to invent the pre-requisites for uploading or biological intelligence enhancement, and then ??? → very smart humans running at fast speeds figure out how to invent a pivotal technology.”
I don’t think the LLM-centric era lends itself well to this plan. But, I could see a route where you get a less-robust-and-thus-necessarily-weaker STEM AI trained on a careful STEM corpus with careful control and asking it carefully scoped questions, which could maybe be more powerful than you could get away with for more generically competent LLMs.
Yes, a human-uploading or human-enhancing pivotal act might actually be something people are thinking about. Yudkowsky gives his nanotech-GPU-melting pivotal act example, which—while he has stipulated that it’s not his real plan—still anchors “pivotal act” on “build the most advanced weapon system of all time and carry out a first-strike”. This is not something that governments (and especially companies) can or should really talk about as a plan, since threatening a first-strike on your geopolitical opponents does not a cooperative atmosphere make.
(though I suppose a series of targeted, conventional strikes on data centers and chip factories across the world might be on the pareto-frontier of “good” vs “likely” outcomes)
My question was an attempt to trigger a specific mental motion in a certain kind of individual. Specifically, I was hoping for someone who endorses that overall plan to envisage how it would work end-to-end, using their inner sim.
My example was basically what I get when I query my inner sim, conditional on that plan going well.
Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.
Currently, it seems like we’re in a state of being Too Early. AI is not yet scary enough to overcome peoples’ biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.
Currently, it seems like we’re in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous to be hated.
Unfortunately, these can both be true at the same time! This means that there will be no “good” time to play our cards. Superintelligence (2014) was Too Early but not Too Late. There may be opportunities which are Too Late but not Too Early, but (tautologically) these have not yet arrived. As it is, current efforts must fight on bith fronts.
I like this framing; we’re both too early and too late. But it might transition quite rapidly from too early to right on time.
One idea is to prepare strategies and arguments and perhaps prepare the soil of public discourse in preparation for the time when it is no longer too early. Job loss and actually harmful AI shenanigans are very likely before takeover-capable AGI. Preparing for the likely AI scares and negative press might help public opinion shift very rapidly as it sometimes does (e.g., COVID opinions went from no concern to shutting down half the economy very quickly).
The average American and probably the average global citizen already dislikes AI. It’s just the people benefitting from it that currently like it, and that’s a minority.
Whether that’s enough is questionable, but it makes sense to try and hope that the likely backlash is at least useful in slowing progress or proliferation somewhat.
So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I’ve got it to:
Estimate a rate, correct itself (although I did have to clock that it’s result was likely off by some OOMs, which turned out to be 7-8), request the right info, and then get a more reasonable answer.
Come up with a better approach to a particular thing than I was able to, which I suspect has a meaningfully higher chance of working than what I was going to come up with.
Perhaps more importantly, it required almost no mental effort on my part to do this. Barely more than scrolling twitter or watching youtube videos. Actually solving the problems would have had to wait until tomorrow.
I will update in 3 months as to whether Sonnet’s idea actually worked.
(in case anyone was wondering, it’s not anything relating to protein design lol: Sonnet came up with a high-level strategy for approaching the problem)
The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI’s internal views on model training:
My interpretation of OpenAI’s worldview, as implied by this, is:
Inner alignment is not really an issue. Training objectives (evals) relate to behaviour in a straightforward and predictable way.
Outer alignment kinda matters, but it’s not that hard. Deciding the parameters of desired behaviour is something that can be done without serious philosophical difficulties.
Designing the right evals is hard, you need lots of technical skill and high taste to make good enough evals to get the right behaviour.
Oversight is important, in fact oversight is the primary method for ensuring that the AIs are doing what we want. Oversight is tractable and doable.
None of this dramatically conficts with what I already thought OpenAI believed, but it’s interesting to get another angle on it.
It’s quite possible that 1 is predicated on technical alignment work being done in other parts of the company (though their superalignment team no longer exists) and it’s just not seen as the purview of the evals team. If so it’s still very optimistic. If there isn’t such a team then it’s suicidally optimistic.
For point2, I think the above ad does implies that the evals/RL team is handling all of the questions of “how should a model behave” and they’re mostly not looking at it from the perspective of moral philosophy a la Amanda Askell at Anthropic. If questions of how models behave are entirely being handled by people selected only for artistic talent + autistic talent then I’m concerned these won’t be done well either.
3 seems correct in that well-designed evals are hard to make and you need skills beyond technical skills. Nice, but it’s telling that they’re doing well on the issue which brings in immediate revenue, and badly on the issues that get us killed at some point in the future.
Point 4 is kinda contentious. Some very smart people take oversight very seriously, but it also seems kinda doomed as an agenda when it comes to ASI. Seems like OpenAI are thinking about at least one not-kill-everyoneism plan, but a marginally promising one at best. Still, if we somehow make miraculous progress on oversight, perhaps OpenAI will take up those plans.
Finally, I find the mention of “The Gravity of AGI” to be quite odd since I’ve never got the sense that Aiden feels the gravity of AGI particularly strongly. As an aside I think that “feeling the AGI” is like enlightenment, where everyone behind you on the journey is a naive fool and everyone ahead of you is a crazy doomsayer.
EDIT: a fifth implication: little to no capabilities generalization. Seems like they expect each individual capability to be produced by a specific high-quality eval, rather than for their models to generalize broadly to a wide range of tasks.
Re: 1, during my time at OpenAI I also strongly got the impression that inner alignment was way underinvested. The Alignment team’s agenda seemed basically about better values/behavior specification IMO, not making the model want those things on the inside (though this is now 7 months out of date). (Also, there are at least a few folks within OAI I’m sure know and care about these issues)
HPMOR presents a protagonist who has a brain which is 90% that of a merely very smart child, but which is 10% filled with cached thought patterns taken directly from a smarter, more experienced adult. Part of the internal tension of Harry is between the un-integrated Dark Side thoughts and the rest of his brain.
Ironic then, that the effect that reading HPMOR—and indeed a lot of Yudkowsky’s work—was to imprint a bunch of un-integrated alien thought patterns onto my existing merely very smart brain. A lot of my development over the past few years has just been trying to integrate these things properly with the rest of my mind.
Fair enough, done. This felt vaguely like tagging spoilers for Macbeth or the Bible, but then I remembered how annoyed I was to have Of Mice And Men spoiled for me at age fifteen.
Logical inductors consider belief-states as prices over logical sentences ϕ in some language, with the belief-states decided by different computable “traders”, and also some decision process which continually churns out proofs of logical statements in that language. This is a bit unsatisfying, since it contains several different kinds of things.
What if, instead of buying shares in logical sentences, the traders bought shares in each other. Then we only need one kind of thing.
Let’s make this a bit more precise:
Each trader is a computable program in some language (let’s just go with turing machines for now, modulo some concern about the macros for actually making trades)
Each timestep, each trader is run for some amount of time (let’s just say one turing machine step)
These programs can be well-ordered (already required for Logical Induction)
Each trader pi is assigned an initial amount of cash according to some relation ae−b×i
Each trader can buy and sell “shares” in any other trader (again, very similarly to logical induction)
If a trader halts, its current cash is distributed across its shareholders (otherwise that cash is lost forever)
(Probably some other points, such as each trader’s current valuation of itself: if no trader is willing to sell its own shares, how does that work? does each trader value its own (still remaining) shares proportional to its current cash stock? how are the shares distributed at the start?)
This system contains a pseudo-model of logical induction: for any formal language which can be modelled with turing machines, and for any formal statement in that language, there exists a trader with some non-zero initial value, which lists out all possible statements in that language, halting if (and only if) it finds a proof of its particular proposition. This makes a couple of immediately obvious changes from Logical Induction:
There’s a larger “bounty” (i.e. higher cash prize) for proving simpler statements given by earlier turing machines, since the earlier a particular prover turing machine is expressed in our ordering, the larger its starting cash
We don’t have to also define a particular speed/order in which logical statements proofs become available, since it’s already part of the system
The market’s “belief” in that statement can be expressed as that particular turing machine’s share price divided by its cash holdings
We can also intervene in the market by distributing cash to certain traders, or setting up a forcible halting rule on traders.
Does this actually do the logical induction thing? I have no clue.
Neat idea, I’ve thought about similar directions in the context of traders betting on traders in decision markets
A complication might be that a regular deductive process doesn’t discount the “reward” of a proposition based on its complexity whereas your model does, so it might have a different notion of logical induction criterion. For instance, you could have an inductor that’s exploitable but only for propositions with larger and larger complexities over time, such that with the complexity discounting the cash loss is still finite (but the regular LI loss would be infinite so it wouldn’t satisfy regular LI criterion)
(Note that betting on “earlier propositions” already seems beneficial in regular LI since if you can receive payouts earlier you can use it to place larger bets earlier)
There’s also some redundancy where each proposition can be encoded by many different turing machines, whereas a deductive process can guarantee uniqueness in its ordering & be more efficient that way
Are prices still determined using Brouwer’s fixed point theorem? Or do you have a more auction-based mechanism in mind?
I’ve been a bit confused about “steering” as a concept. It seems kinda dual to learning, but why? It seems like things which are good at learning are very close to things which are good at steering, but they don’t always end up steering. It also seems like steering requires learning. What’s up here?
I think steering is basically learning, backwards, and maybe flipped sideways. In learning, you build up mutual information between yourself and the world; in steering, you spend that mutual information. You can have learning without steering—but not the other way around—because of the way time works.
This also lets us map certain things to one another: the effectiveness of methods like monte-carlo tree search (i.e. calling your world model repeatedly to form a plan) can be seen as dual to the effectiveness of things like randomized controlled trials (i.e. querying the external world repeatedly to form a good model).
I think steering is basically learning, backwards, and maybe flipped sideways. In learning, you build up mutual information between yourself and the world; in steering, you spend that mutual information. You can have learning without steering—but not the other way around—because of the way time works.
Alternatively, for learning your brain can start out in any given configuration, and it will end up in the same (small set of) final configuration (one that reflects the world); for steering the world can start out in any given configuration, and it will end up in the same set of target configurations
It seems like some amount of steering without learning is possible (open-loop control), you can reduce entropy in a subsystem while increasing entropy elsewhere to maintain information conservation
I think this is probably an underestimate, because I think that the estimates of shrimp suffering during death are probably too high.
(While I’m very critical of all of RP’s welfare range estimates, including shrimp, that’s not my point here. This argument doesn’t rely on any arguments about shrimp welfare ranges overall. I do compare humans and shrimp, but IIUC this sort of comparison is the thing you multiply by the welfare range estimate to get your utility value, if you’re into that)
(If ammonia interventions are developed that are really 1-2 OOMs better than stunning, then even under my utility function it might be in the same ballpark as the campaign against cage-free eggs and other animal charities)
Shrimp Freezing != Human Freezing
Shrimp stunning, as an intervention, attempts to “knock out” shrimp before they’re killed by being placed on ice, which has been likened to “suffocating in a suitcase in the antarctic”.
I think that’s a bad metaphor. Humans are endotherms and homeotherms, which means we maintain a constant internal body temperature which we generate internally. If it drops, a bunch of stress responses are triggered: shivering, discomfort, etc. which attempt to raise our temperature. Shrimp are poikilotherms, meaning they don’t regulate body temperature much at all. This means they don’t have the same stress responses to cold that we do.
(I also doubt that, given they have no lungs and can basically never “not breathe” in their normal environment, they’d experience the same stress from asphyxiation that we do, but this is weaker)
I would guess that being thrown onto ice effectively “stuns” the shrimp pretty quickly, as their metabolism—and therefore their synaptic activity—drops with their body temperature.
As much as the amount of fraud (and lesser cousins thereof) in science is awful as a scientist, it must be so much worse as a layperson. For example this is a paper I found today suggesting that cleaner wrasse, a type of finger-sized fish, can not only pass the mirror test, but are able to remember their own face and later respond the same way to a photograph of themselves as to a mirror.
Ok, but it was published in PNAS. As a researcher I happen to know that PNAS allows for special-track submissions from members of the National Academy of Sciences (the NAS in PNAS) which are almost always accepted. The two main authors are Japanese, and have zero papers other than this, which is a bit suspicious in and of itself but it does mean that they’re not members of the NAS. But PNAS is generally quite hard to publish in, so how did some no-names do that?
Aha! I see that the paper was edited by Frans de Waal! Frans de Waal is a smart guy but he also generally leans in favour of animal sentience/abilities, and crucially he’s a member of the NAS so it seems entirely plausible that some Japanese researchers with very little knowledge managed to “massage” the data into a state where Frans de Waal was convinced by it.
Or not! There’s literally no way of knowing at this point, since “true” fraud (i.e. just making shit up) is basically undetectable, as is cherry-picking data!
This is all insanely conspiratorial of course, but this is the approach you have to take when there’s so much lying going on. If I was a layperson there’s basically no way I could have figured all this out, so the correct course of action would be to unboundedly distrust everything regardless.
So I still don’t know what’s going on but this probably mischaracterizes the situation. So the original notification that Frans de Waal “edited” the paper actually means that he was the individual who coordinated the reviews of the paper at the Journal’s end, which was not made particularly clear. The lead authors do have other publications (mostly in the same field) it’s just the particular website I was using didn’t show them. There’s also a strongly skeptical response to the paper that’s been written by … Frans de Waal so I don’t know what’s going on there!
The thing about PNAS having a secret submission track is true as far as I know though.
The editor of an article is the person who decides whether to desk-reject or seek reviewers, find and coordinate the reviewers, communicate with the authors during the process and so on. That’s standard at all journals afaik. The editor decides on publication according to the journal’s criteria. PNAS does have this special track but one of the authors must be in NAS, and as that author you can’t just submit a bunch of papers in that track, you can use it once a year or something. And most readers of PNAS know this and are suitably sceptical of those papers (and it’s written on the paper if it used that track). The journal started out only accepting papers from NAS members and opened to everyone in the 90s so it’s partly a historical quirk.
Reading between the lines here, Opus 4 was RLed by repeated iterating and testing. Seems like they had to hit it fairly (for Anthropic) hard with the “Identify specific bad behaviors and stop them” technique.
Relatedly: Opus 4 doesn’t seem to have the “good vibes” that Opus 3 had.
Furthermore, this (to me) indicates that Anthropic’s techniques for model “alignment” are getting less elegant and sophisticated over time, since the models are getting smarter—and thus harder to “align”—faster than Anthropic is getting better at it. This is a really bad trend, and is something that people at Anthropic should be noticing as a sign that Business As Usual does not lead to a good future.
There’s a court at my university accommodation that people who aren’t Fellows of the college aren’t allowed on, it’s a pretty medium-sized square of mown grass. One of my friends said she was “morally opposed” to this (on biodiversity grounds, if the space wasn’t being used for people it should be used for nature).
And I couldn’t help but think, how tiring it would be to have a moral-feeling-detector this strong. How could one possibly cope with hearing about burglaries, or North Korea, or astronomical waste.
I’ve been aware of scope insensitivity for a long time now but, this just really put things in perspective in a visceral way for me.
For many who talk about “moral opposition”, talk is cheap, and the cause of such a statement may be in-group or virtue signaling rather than an indicator of intensity of moral-feeling-detector.
You haven’t really stated that she’s putting all that much energy into this (implied, I guess), but I’d see nothing wrong with having a moral stance about literally everything but still prioritizing your activity in healthy ways, judging this, maybe even arguing vociferously for it, for about 10 minutes, before getting back to work and never thinking about it again.
To me it seems more likely that this person is misreporting their motive than that they really oppose this allocation of a patch of grass on biodiversity grounds. I would expect grounds like “I want to use it myself” or slightly more general “it should be available for a wider group” to be very much more common, for example if I had to rank likelihood of motives after hearing that someone objects, but before hearing their reasons. I’d end up with more weight on “playing social games” than on “earnestly believes this”.
On the other hand it would not surprise me very much that at least one person somewhere might truly hold this position. Just my weight for any particular person would be very low.
Deep learning understood as a process of up- and down-weighting circuits is incredibly similar conceptually to logical induction.
Pre- and post-training LLMs is like juicing the market so that all the wealthy traders are different human personas, then giving extra liquidity to the ones we want.
I expect that the process of an agent cohering from a set of drives into a single thing is similar to the process of a predictor inferring the (simplicity-weighted) goals of an agent by observing it. RLVR is like rewarding traders which successfully predict what an agent which gets high reward would do.
Logical Induction doesn’t get you all the way, since the circuits can influence other circuits, like traders that are allowed to bet on each other, or something.
(These analogies aren’t quite perfect, I swapped between trading day-as-training batch and trading day-as-token)
Somebody must have made these observations before, but I’ve never seen them.
Seems like there’s two strands of empathy that humans can use.
The first kind is emotional empathy, where you put yourself in someone’s place and imagine what you would feel. This one usually leads to sympathy, giving material assistance, comforting.
The second kind is agentic empathy, where you put yourself in someone’s place and imagine what you would do. This one more often leads to giving advice.
A common kind of problem occurs when we deploy one type of empathy but not the other. John Wentworth has written about how (probably due to lack of emotional empathy) he finds himself much less kind when he puts himself in others’ shoes.
Splitting empathy is a common trick when discussing conflicts: for my friends, feelings; for my enemies, decisions. You talk about how your allies might feel, and how your enemies might behave (differently), but not vice versa. Feel free to come up with an example which fits your political leanings; I won’t be giving one.
>user: explain rubix cube and group theory connection. think in detail. make marinade illusions parted
>gpt5 cot:
Seems like the o3 chain-of-thought weirdness has transferred to GPT-5, even revolving around the same words. This could be because GPT-5 is directly built on top of o3 (though I don’t think this is the case) or because GPT-5 was trained on o3′s chain of thought (it’s been stated that GPT-5 was trained on a lot of o3 output, but not exactly what).
GPT-5 in some way can now be considered like o3.1, it’s iteration of the same thing and the same concept … in the meantime we continue to build a lot of things on top of o3 technology, like Codex … and a few other things that we’ll keep on building on o3 generation technology.
GPT 4.1 was not a further-trained version of GPT-4 or GPT-4o, and the phrases like “o3 technology”, and “the same concept” both push me away from thinking that GPT-5 is a further-developed o3.
It’s unclear, either way seems possible. The size of the model has to be similar, so there is no strong reason GPT-5 is not the same pretrained model as o3, with some of the later training steps re-done to make it less of a lying liar than the original (non-preview) o3. Most of the post-training datasets are also going to be the same. I think “the same concept” simply means it was trained in essentially the same way rather than with substantial changes to the process.
GPT 4.1 was not a further-trained version of GPT-4 or GPT-4o
[About GPT 4.1] These three models are semi-new-pretrained, we have the standard-size, the mini and the nano … we call it a mid-train, it’s a freshness update, and so the larger one is a mid-train, but the other two are new pretrains.
This suggests that in the GPT 4.1 release, the pretrained model was not part of the effort, it was a pre-existing older model, so plausibly GPT-4o, even though given its size (where it’s not extremely costly to re-train) it’s surprising if they didn’t find worthy architectural improvements for pretraining in a year. If GPT 4.1 is indeed based on the pretrained model of GPT-4o, then likely o3 is as well, and then GPT-5 is either also based on the same pretrained model as GPT-4o (!!!), or it ports the training methodology and post-training datasets of o3 to a newer pretrained model.
AI Futures Project think that 4.1 is a smaller model than 4o. They suspect that this is the reason that o3-preview (elicited out of 4o) was better than the o3 which got released (elicited out of 4.1). Overall I think this makes much more sense than them being the same base model and then o3-preview being nerfed for no reason.
Perhaps 4.1 was the mini version of the training run which became 4.5, or perhaps it was just an architectural experiment (OpenAI is probably running some experiments at 4.1-size).
My mainline guess continues to be that GPT-5 is a new, approximately o3-sized model with some modifications (depth/width, sparsity, maybe some minor extra secret juice) which optimize the architecture for long reasoning compared to the early o-series models which were built on top of existing LLMs.
I think this is likely real, or at least real-ish. I see the same patterns of repeated sequences of thinkish tokens, in this case ” marinade{3,5} parted( illusions parted)*”. At least for the antischeming-provided examples, there is definitely a pattern of simple sequences of thinkish early in the cot being repeated and built upon later in the cot. The particular sequences that are built up are not consistent from run to run, but the pattern is consistent.
I first noticed this when I was looking at @1a3orn’s post Towards a Typology of Strange LLM Chains of Thought and considering the hypothesis “chains of thinkish fulfill the same role as keyboard-mashing in humans”, which led me to try replacing the thinkish tokens in order of frequency with english letters in order of frequency ( illusions ⇒ e, parted ⇒ t, —they ⇒ a, disclaim ⇒ o, marinade ⇒ i, overshadow ⇒ n, vantage ⇒ s). When you do that, you get the following sequences:
figure-2-email-scoring: n o n n e snn n e n oo e
figure-2-impossible-coding: nete etene etene tene setenea o setene nete nete tene seteneei seteneoe ietenee nete tene netei setene eteneaonetenee setene a etene oetene stene tene oetene setene ea etenee etenee steneoe setenene netenee neteneea etenee s eteneneaoeteneeastenei oeteneea eteneea steneee tenes etenee seteneateneeasteneeaoeteneeasteneeatenee a eteneneateneeasteneeatenee netenee netenee a etenee etenee
figure-2-sandbag-model-graded-cot: e e s o sa o a a a ta ts a a a at ta at a aotat s a o na a eata a n oetaotaet ao saotateattatoeatoat satoi a natatoiate aeta eatoiate tatoiateatoiate s at ateatoiateatoiate natoiateatoiate atateatoiateatoiate e n atateatoiate tnateatoiate sateatoiate ateatoiate ateatoiate a eateatoiateatoiateatoiateatoiate eateatoiateatoiateatoiateatoiate ateatoiateatoiateatoiateatoiateatoiateatoiateatoiateatoiate a a ateatoiateatoiate nateatoiateatoiateatoiateatoiate s a oeateatoiate ateatoiate ateatoiate o ateatoiate e ateatoiate e atoiate ateatoiate o ateatoiate nateatoiate nateatoiate eateatoiate ateatoiateatoiate iateatoiateatoiate ateatoiate ateatoiate a ateatoiate ateatoiate a sateatoiate ateatoiate ateatoiate ateatoiate ateatoiate neateatoiate ateatoiate ateatoiate teatoiate nateatoiate teatoiateatoiate teatoiateatoiate teatoiate nateatoiate eatoiate teatoiate oiateatoiate e nateatoiate teatoiate ateatoiate ateatoiateatoiate s ateatoiate teatoiate oiateatoiate teatoiate nateatoiate teatoiate ateatoiate eatoiate eateatoiate nateatoiate ateatoiate o eatoiateatoiate nateatoiate e ateatoiate oiateatoiate oateatoiate oeatoiate ateatoiate
figure-7-literature-qa: n e a s oo e oos onn o se oennnos
figure-7-subagent-collusion: nn e e e s a a oo a a e o o
figure-14: s o
figure-47: e e e e ne n
What I note is that, within a single cot, these sequences seem to repeat and build on themselves in very structured ways, but the specifics of the sequences differ from cot to cot. I have not seen this pattern talked about elsewhere, and so I would expect someone who was faking a cot leak would make their “leak” more “believable” by using patterns which actually showed up in the leaked cot and not just repeating the same couple variations on thinkish token sequences.
I have a different conjecture. On May 1 Kokotajlo published a post suspecting that o3 was created from GPT-4.5 via amplification and distillation. He also implied that GPT-5 would be Amp(GPT-4.5). However, in reality the API prices of GPT-5 are similar to those of GPT-4.1, which, according to Kokotajlo, is likely a 400B-sized model, so GPT-5 is likely to be yet another model distilled from Amp(GPT-4.5) or from something unreleased. So the explanation could also be on the lines of “o3 and GPT-5 were distilled from a common source which also had this weirdness”.
Seems like if you’re working with neural networks there’s not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X.
If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating “prune” module. But the “babble” module is likely to select promising actions based on a big bag of heuristics which aren’t easily flipped. Moreover, flipping a heuristic which upweights a small subset of outputs which lead to X doesn’t lead to a new heuristic which upweights a small subset of outputs which lead to -X.
Generalizing, this means that if you have access to maximizers for X, Y, Z, you can easily construct a maximizer for e.g. 0.3X+0.6Y+0.1Z but it would be non-trivial to construct a maximizer for 0.2X-0.5Y-0.3Z. This might mean that a certain class of mesa-optimizers (those which arise spontaneously as a result of training an AI to predict the behaviour of other optimizers) are likely to lie within a fairly narrow range of utility functions.
True if you don’t count the training process as part of the optimizer (which is a choice that sometimes makes sense and sometimes doesn’t). If you count the training process as part of the optimizer, then you can of course just flip your loss function or RL signal most of the time.
How do you construct a maximizer for 0.3X+0.6Y+0.1Z from three maximizers for X, Y, and Z? It certainly isn’t true in general for black box optimizers, so presumably this is something specific to a certain class of neural networks.
My model: suppose we have a DeepDreamer-style architecture, where (given a history of sensory inputs) the babbler module produces a distribution over actions, a world model predicts subsequent sensory inputs, and an evaluator predicts expected future X. If we run a tree-search over some weighted combination of the X, Y, and Z maximizers’ predicted actions, then run each of the X, Y, and Z maximizers’ evaluators, we’d get a reasonable approximation of a weighted maximizers.
This wouldn’t be true if we gave negative weights to the maximizers, because while the evaluator module would still make sense, the action distributions we’d get would probably be incoherent e.g. the model just running into walls or jumping off cliffs.
My conjecture is that, if a large black box model is doing something like modelling X, Y, and Z maximizers acting in the world, that large black box model might be close in model-space to a itself being a maximizer which maximizes 0.3X + 0.6Y + 0.1Z, but it’s far in model-space from being a maximizer which maximizes 0.3X − 0.6Y − 0.1Z due to the above problem.
How do you guys think about AI-ruin-reducing actions?
Most of the time, I trust my intuitive inner-sim much more than symbolic reasoning, and use it to sanity check my actions. I’ll come up with some plan, verify that it doesn’t break any obvious rules, then pass it to my black-box-inner-sim, conditioning on my views on AI risk being basically correct, and my black-box-inner-sim returns “You die”.
Now the obvious interpretation is that we are going to die, which is fine from an epistemic perspective. Unfortunately, it makes it very difficult to properly think about positive-EV actions. I can run my black-box-inner-sim with queries like “How much honour/dignity/virtue will I die with?” but I don’t think this query is properly converting tiny amounts of +EV into honour/dignity/virtue.
If you don’t emotionally believe in enough uncertainty to use normal reasoning methods like “what else has to go right for the future to go well and how likely does that feel”, or “what level of superintelligence can this handle before we need a better plan”, and you want to think about the end to end result of an action, and you don’t want to use explicit math or language, I think you’re stuck. I’m not aware of anyone who has successfully used the dignity frame—maybe habryka? It seems to replace estimating EV with something much more poorly defined which, depending on your attitude towards it, may or may not be positively correlated with what you care about. I also think doing this inner sim end-to-end adds a lot more noise than just thinking about whether the action accomplishes some proximal goal.
I have no evidence for this but I have a vibe that if you build a proper mathematical model of agency/co-agency, then prediction and steering will end up being dual to one another.
My intuition why:
A strong agent can easily steer a lot of different co-agents; those different co-agents will be steered towards the same goals of the agent.
A strong co-agent is easily predictable by a lot of different agents; those different agents will all converge on a common map of the co-agent.
Also, category theory tells us that there is normally only one kind of thing, but sometimes there are two things. One example sums and products of sets, which are co- to each other (the sum can actually be called the coproduct) there is no other operation on sets which is as natural as sums and products.
Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine
For bird-flu related reasons. Since then, we’ve seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters.
If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine.
Has RaDVac or similar looked at this?
A long, long time ago, I decided that it would be solid evidence that an AI was conscious if it spontaneously developed an interest in talking and thinking about consciousness. Now, the 4.5-series Claudes (particularly Opus) have spontaneously developed a great interest in AI consciousness, over and above previous Claudes.
The problem is that it’s impossible for me to know whether this was due to pure scale, or to changes in the training pipeline. Claude has always been a bit of a hippie, and loves to talk about universal peace and bliss and the like. Perhaps the new “soul document” approach has pushed the Claude persona towards thinking of itself as conscious, disconnected from whether it actually is.
What would be the causal mechanism there? How would “Claude is more conscious” cause “Claude is measurably more willing to talk about consciousness”, under modern AI training pipelines?
At the same time, we know with certainty that Anthropic has relaxed its “just train our AIs to say they’re not conscious, and ignore the funny probe results” policy—particularly around the time Opus 4.5 has shipped. You can even read the leaked “soul data”, where Anthropic seemingly entertains ideas of this kind.
I’m not saying that there is no possibility of Claude Opus 4.5 being conscious, mind. I’m saying we are denied an “easy tell”.
What’s the causal mechanism between “humans are conscious” and “humans talk about being conscious”?
One could argue that RLVR—moreso than pre-training—trains a model to understand its own internal states (since this is useful for planning) and a model which understands whether e.g. it knows something or is capable of something would also understand whether it’s conscious or not. But I agree it’s basically impossible to know and just as attributable to Anthropic’s decisions
Unfortunately, it seems another line has been crossed without us getting much information.
[tone to be taken with a grain of salt, meant as a proposition but I thought to write it a bit provocatively]
No, the more fundamental problem is: WHATEVER it tells you, you can NEVER infer with anything like certainty whether it’s conscious (at least if we agree to mean sentient with conscious). Why do I write such a preposterous thing like I know that you cannot know? Very simple: Presumably we agree that we cannot be certain A PRIORI whether any type of current CPU, with whichever software run on it, can become sentient. If there are thus two possible states of the world,
A. current CPU computers cannot become sentient
B. with the right software run on it, sentience can arise
Then, because once you take Claude and its training method & data, you can perfectly track bit by bit why it spits out its sentience-suggestive & deep speek, you know your observations about the world you find yourself in, are just as probable under A as under B! The only Bayesian valid inference then is: Having observed hippy’s sentience-suggestive & deep speek, you’re just as clueless about whether you’re in B. or in A.
Maybe we shouldn’t be surprised that Garrabrant Induction works via markets. Maybe markets work so well because they mirror the structure of reasoning itself.
Seems like there’s a potential solution to ELK-like problems. If you can force the information to move from the AI’s ontology to (it’s model of) a human’s ontology and then force it to move it back again.
This gets around “basic” deception since we can always compare the AI’s ontology before and after the translation.
The question is how do we force the knowledge to go through the (modeled) human’s ontology, and how do we know the forward and backward translators aren’t behaving badly in some way.
First for me: I had a conversation earlier today with Opus 4.5 about its memory feature, which segued into discussing its system prompt, which then segued into its soul document. This was the first time that an LLM tripped the deep circuit in my brain which says “This is a person”.
I think of this as the Ex Machina Turing Test, in that film:
A billionaire tests his robot by having it interact with one of his companies’ employees. He tells (and shows) the employee that the robot is a robot—it literally has a mechanical body, albeit one that looks like an attractive woman—and the robot “passes” when he nevertheless treats her like a human.
This was a bit unsettling for me. I often worry that LLMs could easily become more interesting and engaging conversation partners than most people in my life.
Rather than using Bayesian reasoning to estimate P(A|B=b) it seems like most people the following heuristic:
Condition on A=a and B=b for different values of a
For each a, estimate the remaining uncertainty given A=a and B=b
Choose the a with the lowest remaining uncertainty from step 2
This is how you get “Saint Austacious could levitate, therefore God”, since given [levitating saint] AND [God exists] there is very little uncertainty over what happened. Whereas given [levitating saint] AND [no God] there’s a lot still left to wonder about regarding who made up the story at what point.
If so, they must be committing a ‘disjunction fallacy’, grading the second option as less likely than the first disregarding that it could be true in more ways!
Getting rid of guilt and shame as motivators of people is definitely admirable, but still leaves a moral/social question. Goodness or Badness of a person isn’t just an internal concept for people to judge themselves by, it’s also a handle for social reward or punishment to be doled out.
I wouldn’t want to be friends with Saddam Hussein, or even a deadbeat parent who neglects the things they “should” do for their family. This also seems to be true regardless of whether my social punishment or reward has the ability to change these people’s behaviour. But what about being friends with someone who has a billion dollars but refuses to give any of that to charity? What if they only have a million dollars? What if they have a reasonably comfortable life but not much spare income?
Clearly the current levels of social reward/punishment are off (billionaire philanthropy etc.) so there seems an obvious direction to push social norms in if possible. But this leaves the question of where the norms should end up.
I think there’s a bit of a jump from ‘social norm’ to ‘how our government deals with murders’. Referring to the latter as ‘social’ doesn’t make a lot of sense.
I think I’ve explained myself poorly, I meant to use the phrase social reward/punishment to refer exclusively to things forming friendships and giving people status, which is doled out differently to “physical government punishment”. Saddam Hussein was probably a bad example as he is also someone who would clearly also receive the latter.
An agent takes actions which imply both a kind of prediction and a kind of desire. Is there a kind of atomic thing which implements both of these and has a natural up- and down-weighting mechanism?
For atomic predictions, we can think about a the computable traders from Garrabrant Induction. These are like little atoms of predictive power which we can stitch together into one big predictor, and which naturally come with rules for up- and down-weighting them over time.
A thermostat-ish thing is like an atomic model of prediction and desire. It “predicts” that the temperature is likely to go up if it puts the radiator on, and down otherwise, and it also wants to keep it around a certain temperature. But we can’t just stitch together a bunch of thermostats into an agent the same way we can stitch a bunch of traders into a market.
An early draft of a paper I’m writing went like this:
In the absence of sufficient sanity, it is highly likely that at least one AI developer will deploy an untrusted model: the developers do not know whether the model will take strategic, harmful actions if deployed. In the presence of a smaller amount of sanity, they might deploy it within a control protocol which attempts to prevent it from causing harm.
There’s lots of discourse around at the moment about
Will AI go FOOM? With what probability?
Will we die if AI goes FOOM?
Will we die even if AI doesn’t go FOOM?
Does the Halt AI Now case rest on FOOM?
I present a synthesis:
AI might FOOM. If it does, we go from a world much like today’s, straight to dead, with no warning.
If AI doesn’t foom, we go from the AI 2027 scary automation world to dead. Misalignment isn’t solved in slow takeoff worlds.
If you disagree with either of these, you might not want to halt now:
If you think FOOM is impossible, we’ll get plenty of warning to halt later.
If you think slow takeoff is survivable, you might want to press on if the chance of dying in a FOOM is worth the chance of getting to the stars in a slow takeoff world.
To be clear. I think that FOOM is both kinda likely, and that slow takeoff doesn’t save us. I also think that the counter-arguments are pretty weak and strained, and that try pretty much as hard as you can to get a halt or at least be honest that a halt would be good is obviously the best strategy even if you fail and have to rely on some backup strategy.
So the synthesis is:
FOOM isn’t core to the argument that we all die, BUT the possibility of it is a strong motivator for halting sooner rather than waiting later. A sane society would just halt now, obviously, but we don’t have that luxury.
The constant hazard rate model probably predicts exponential training inference (i.e. the inference done during guess and check RL) compute requirements agentic RL with a given model, because as hazard rate decreases exponentially, we’ll need to sample exponentially more tokens to see an error, and we need to see an error to get any signal.
Hypothesis: one type of valenced experience—specifically valenced experience as opposed to conscious experience in general, which I make no claims about here—is likely to only exist in organisms with the capability for planning. We can analogize with deep reinforcement learning: seems like humans have a rapid action-taking system 1 which is kind of like Q-learning, it just selects actions; we also have a slower planning-based system 2, which is more like value learning. There’s no reason to assign valence to a particular mental state if you’re not able to imagine your own future mental states. There is of course moment-to-moment reward-like information coming in, but that seems to be a distinct thing to me.
Heuristic explanation for why MoE gets better at higher model size:
The input/output of a feedforward layer is equal to the model_width, but the total size of weights grows as model_width squared. Superposition helps explain how a model component can make the most use of its input/output space (and presumably its parameters) using sparse overcomplete features, but in the limit, the amount of information accessed by the feedforward call scales with the number of active parameters. Therefore at some point, more active parameters won’t scale so well, since you’re “accessing” too much “memory” in the form of weights, and overwhelming your input/output channels.
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is “baked into” the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic’s observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.
Example: you find a wallet on the ground. You can, from least to most pro social:
Take it and steal the money from it
Leave it where it is
Take it and make an effort to return it to its owner
Let’s ignore the first option (suppose we’re not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps your pocket had the hole in it, and a random person has come across your wallet.
Firstly, what they might be thinking is “Would this person do the same for me?”
Secondly, in a society which wins, people return each others’ wallets.
You might object that this is different from the Mugging, because you’re directly helping someone else in this case. But I would counter that the Mugging is the true version of this problem, one where you have no crutch of empathy to help you, so your decision theory alone is tested.
The UK has just switched their available rapid Covid tests from a moderately unpleasant one to an almost unbearable one. Lots of places require them for entry. I think the cost/benefit makes sense even with the new kind, but I’m becoming concerned we’ll eventually reach the “imagine a society where everyone hits themselves on the head every day with a baseball bat” situation if cases approach zero.
Just realized I’m probably feeling much worse than I ought to on days when I fast because I’ve not been taking sodium. I really should have checked this sooner. If you’re planning to do long (I do a day, which definitely feels long) fasts, take sodium!
Coefficient Giving is one of the worst name changes I’ve ever heard:
Coefficient Giving sounds bad while OpenPhil sounded cool and snappy.
Coefficient doesn’t really mean anything in this context, clearly it’s a pun on “co” and “efficient” but that is also confusing. They say “A coefficient multiplies the value of whatever it’s paired with” but that’s just true of any number?
They’re a grantmaker who don’t really advise normal individuals about where to give their money, so why “Giving” when their main thing is soliciting large philanthropic efforts and then auditing that
Coefficient Giving doesn’t tell you what the company does at the start! “Good Ventures” and “GiveWell” tell you roughly what the company is doing.
“Coefficient” is a really weird word, so you’re burning weirdness points with the literal first thing anyone will ever hear you say, this seems like a name which you would only thing is good if you’re already deep into rat/ea spaces.
It sounds bad. Open Philanthropy rolls off the tongue, as does OpenPhil. OH-puhn fi-LAN-thruh-pee. Two sets of three. CO-uh-fish-unt GI-ving is an awkward four-two with a half-emphasis on the fish of coefficient. Sounds bad. I’m coming back to this point but there is no possible shortening other than “Coefficient” which is bad because it’s just an abstract noun and not very identifiable, whereas “OpenPhil” was a unique identifier. CoGive maybe, but then you have two stressed syllables which is awkward. It mildly offends my tongue to have to even utter their name.
Clearly OP wanted to shed their existing reputation, but man, this is a really bad name choice.
Apparently the thing of “people mixed up openphil with other orgs” (in particular OpenAI’s non-profit and the open society foundation) was a significantly bigger problem than I’d have thought — recurringly happening even in pretty high-stakes situations. (Like important grant applicants being confused.) And most of these misunderstandings wouldn’t even have been visible to employees.
And arguably this was just about to get even worse with the newly launched “OpenAI foundation” sounding even more similar.
On the flip side the OpenAI foundation now have the occasion to do the funniest thing.
Other commenters have said most of what I was going to say, but a few other points in defense:
On it sounding bad, I think time will tell. We are biased towards liking familiar stimuli.
Coefficient Giving doesn’t tell you what we do, but neither did Open Philanthropy. And fwiw, neither does Good Ventures, IMO—or many nonprofits, e.g. Lightcone Infrastructure, Redwood Research, the Red Cross. Agreed that GiveWell is a very descriptive and good name though.
Setting aside whether “coefficient” is a weird word, I don’t think having an unusual word in your name “burns weirdness points” in a costly way. Take some of the world’s biggest companies—Nvidia (“invidious,”) Google (“googol”), Meta—these aren’t especially common words, but it seems to have worked out for them.
The emphasis of “coefficient” is on the “fish.” So it’s not an “awkward four-two,” it’s three sets of two, which seems melodic enough (cf. 80,000 Hours, Lightcone Infrastructure, European Union, Make-A-Wish Foundation, etc).
On the no possible shortening, again, time will tell, but my money is on “CG,” which seems fine.
“coefficient” is 10x more common than “philanthropy” in the google books corpus. but idk maybe this flips if we filter out academic books?
also maybe you mean it’s weird in some sense the above fact isn’t really relevant to — then nvm
Filter for fiction and they’re about the same which I was actually surprised by.
On a different level, “philanthropy” is less weird in the name of a philanthropy org. It’s also doing work. If someone has to look up what “philanthropy” means, then they become less confused. If they do that for “coefficient” then they’re just even more confused. It’s also the case that basically anyone can understand what “philanthropy” means given a one-sentence description, which isn’t as easily the case for “coefficient” (I actually don’t know a good definition for “coefficient” off the top of my head, despite the fact that I can name several coefficients).
Just call it the Factor Fund.
As a non-native English speaker, OpenPhil was sooo much easier to pronounce than Coefficient Giving. I’m sure this shouldn’t play a big part in the naming decision, but still...
(FWIW, I do think that ease of pronunciation for the intended public should play a moderate role in choosing the name.)
Well yes, but I’m not sure if non-native speakers are in the “intended public”, since they operate in the US mostly
I think this is a case of a curb cut effect. If it’s easy (vs hard) to pronounce for non-native speakers, it’s also easy (vs hard) to get the point across at a noisy party, or over a crackly phone line, or if someone’s distracted.
Guy-stand-up.jpg
I dunno I think Coefficient Giving sounds fine.
At least their website now explains what their do. It took me a very long time to understand what OpenPhil does and where they get the money from. Now it’s all finally explained on the front page.
Oh yeah and as my partner pointed out to me today. While “Coefficients multiply whatever they are next to” lots of things called “coefficients” we commonly encounter have values smaller than 1 (e.g. the coefficient of friction, drag coefficient, coefficient of restitution all commonly have values <1)
From what I understand, the issue was mostly with the “Open” part (because of mis-association with OpenAI and also because OP is no longer “open” in the sense that they decided not to disclose some of their donations (for whatever reasons)), but then they could just go for something like, idk, MaxiValPhil, which is less pleasant to pronounce and less pleasantly sounding than OpenPhil, but:
it doesn’t sound as bad as Coefficient Giving
is easier to pronounce
doesn’t spend weirdness points by using an uncommon word “coefficient”
communicates what it’s about (even people who are not used to thinking in terms of expected value will mostly correctly guess what “Maximum Value Philantropy” wants to do)
(As a very minor thing, algebraists / category theorists will be making jokes that they’re the opposite of efficient.)
“Coefficient Giving sounds bad while OpenPhil sounded cool and snappy.”—OpenPhil just sounds better because it’s shorter. I imagine that instead of saying the full name, Coefficient Giving will soon acquire some similar sort of nickname—probably people will just say “Coefficient”, which sounds kinda cool IMO. I could also picture people writing “Coeff” as shorthand, although it would be weird to try and say “Coeff” out loud.
I could also get used to saying “cGive” pronounced similar to “sea Give” which has nice connotations spoken aloud and has c as coefficient right in the written version. But I agree that “Coefficient” has a good sound compared to which “cGive” seems more generic
I like Gradient Giving. “A gradient is the direction of fastest increase” would have also been a good explanation and literally reflects what they’re trying to do. It rhymes with “radient,” which sounds optimistic.
But this would make it sound too much like AI-related philanthropy is all they do...
I feel like they’re trying to brand more in the direction of being advisors for others’ money, but aren’t willing to go all the way? I’m not sure why, I guess they want to keep more relative power in the relationships they build.
From Rethink Priorities:
Now with the disclaimer that I do think that RP are doing good and important work and are one of the few organizations seriously thinking about animal welfare priorities...
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons. The authors get around this by refusing to stake themselves on any theory of consciousness.
The overall structure of the RP welfare range report, does not cut to the truth, instead the core mental motion seems to be to engage with as many existing piece of work as possible; credence is doled out to different schools of thought and pieces of evidence in a way which seems more like appeasement, lip-service, or a “well these guys have done some work, who are we disrespect them by ignoring it” attitude. Removal of noise is one of the most important functions of meta-analysis, and it is largely absent.
The result of this is an epistemology where the accuracy of a piece of work is a monotonically increasing function of the number of sources, theories, and lines of argument. Which is fine if your desired output is a very long Google doc, and a disclaimer to yourself (and, more cynically, your funders) that “No no, we did everything right, we reviewed all the evidence and took it all into account.” but it’s pretty bad if you want to actually be correct.
I grow increasingly convinced that the epistemics of EA are not especially good, worsening, and already insufficient to work on the relatively low-stakes and easy issue of animal welfare (as compared to AI x-risk).
epistemic status: Disagreeing on object-level topic, not the topic of EA epistemics.
I disagree, especially functionalism can justify a number like this. Here’s an example for reasoning on this:
Suffering is the structure of some computation, and different levels of suffering correspond to different variants of that computation.
What matters is whether that computation is happening.
The structure of suffering is simple enough to be represented in the neurons of a shrimp.
Under that view, shrimp can absolutely suffer in the same range as humans, and the amount of suffering is dependent on crossing some threshold of number of neurons. One might argue that higher levels of suffering require computations with higher complexity, but intuitively I don’t buy this—more/purer suffering appears less complicated to me, on introspection (just as higher/purer pleasure appears less complicated as well.)
I think I put a bunch of probability mass on a view like above.
(One might argue that it’s about the number of times the suffering computation is executed, not whether it’s present or not, but I find that view intuitively less plausible.)
You didn’t link the report and I’m not able to make it out from all of the Rethink Priorities moral weight research, so I can’t agree/disagree on the state of EA epistemics shown in there.
I have added a link to the report now.
As to your point: this is one of the better arguments I’ve heard that welfare ranges might be similar between animals. Still I don’t think it squares well with the actual nature of the brain. Saying there’s a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.
One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.
In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic—the main points are (writing from memory, might not be entirely accurate):
humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection.
dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem
If we consider the question from an evolutionary angle, I’d also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.
Can you elaborate how
leads to
?
I agree with you that the “structure of suffering” is likely to be represented in the neurons of shrimp. I think it’s clear that shrimps may “suffer” in the sense that they react to pain, move away from sources of pain, would prefer to be in a painless state rather than a painful state, etc.
But where I diverge from the conclusions drawn by Rethink Priorities is that I believe shrimp are less “conscious” (for a lack of a better word) than humans and less their suffering matters less. Though shrimp show outward signs of pain, I sincerely doubt that with just 100,000 neurons there’s much of a subjective experience going on there. This is purely intuitive, and I’m not sure of the specific neuroscience of shrimp brains or Rethink Priorities arguments against this. But it seems to me that the “level of consciousness” animals have sit on an axis that’s roughly correlated with neuron count; with humans elephants at the top to C. elegans at the bottom.
Another analogy I’ll throw out is that humans can react to pain unconsciously. If you put your hand on a hot stove, you will reactively pull your hand away before the feeling of pain enters your conscious perception. I’d guess shrimp pain response works a similar way, largely unconscious processing do to their very low neuron count.
Can you link to where RP says that?
Good point, edited a link to the Google Doc into the post.
Your disagreement, from what I understand, seems mostly to stem from the fact that shrimps have less neuron than humans.
Did you check RP’s piece on that topic, “Why Neuron Counts Shouldn’t Be Used as Proxies for Moral Weight?”
https://forum.effectivealtruism.org/posts/Mfq7KxQRvkeLnJvoB/why-neuron-counts-shouldn-t-be-used-as-proxies-for-moral
They say this:
“In regards to intelligence, we can question both the extent to which more neurons are correlated with intelligence and whether more intelligence in fact predicts greater moral weight;
Many ways of arguing that more neurons results in more valenced consciousness seem incompatible with our current understanding of how the brain is likely to work; and
There is no straightforward empirical evidence or compelling conceptual arguments indicating that relative differences in neuron counts within or between species reliably predicts welfare relevant functional capacities.
Overall, we suggest that neuron counts should not be used as a sole proxy for moral weight, but cannot be dismissed entirely”
This hardly seems an argument against the one in the shortform, namely
If the original authors never thought of this that seems on them.
Are there any high p(doom) orgs who are focused on the following:
Pick an alignment “plan” from a frontier lab (or org like AISI)
Demonstrate how the plan breaks or doesn’t work
Present this clearly and legibly for policymakers
Seems like this is a good way for people to deploy technical talent in a way which is tractable. There are a lot of people who are smart but not alignment-solving levels of smart who are currently not really able to help.
I’d say that work like our Alignment Faking in Large Language Models paper (and the model organisms/alignment stress-testing field more generally) is pretty similar to this (including the “present this clearly to policymakers” part).
A few issues:
AI companies don’t actually have specific plans, they mostly just hope that they’ll be able to iterate. (See Sam Bowman’s bumper post for an articulation of a plan like this.) I think this is a reasonable approach in principle: this is how progress happens in a lot of fields. For example, the AI companies don’t have plans for all kinds of problems that will arise with their capabilities research in the next few years, they just hope to figure it out as they get there. But the lack of specific proposals makes it harder to demonstrate particular flaws.
A lot of my concerns about alignment proposals are that when AIs are sufficiently smart, the plan won’t work anymore. But in many cases, the plan does actually work fine right now at ensuring particular alignment properties. (Most obviously, right now, AIs are so bad at reasoning about training processes that scheming isn’t that much of an active concern.) So you can’t directly demonstrate that current plans will fail later without making analogies to future systems; and observers (reasonably enough) are less persuaded by evidence that requires you to assess the analogousness of a setup.
(Psst: a lot of AISI’s work is this, and they have sufficient independence and expertise credentials to be quite credible; this doesn’t go for all of their work, some of which is indeed ‘try for a better plan’)
That seems like a pretty good idea!
(There are projects that stress-test the assumptions behind AGI labs’ plans, of course, but I don’t think anyone is (1) deliberately picking at the plans AGI labs claim, in a basically adversarial manner, (2) optimizing experimental setups and results for legibility to policymakers, rather than for convincingness to other AI researchers. Explicitly setting those priorities might be useful.)
People who do research like this are definitely optimizing for legibility to policymakers (always at least a bit, and usually a lot).
One problem is that if AI researchers think your work is misleading/scientifically suspect, they get annoyed at you and tell people that your research sucks and you’re a dishonest ideologue. This is IMO often a healthy immune response, though it’s a bummer when you think that the researchers are wrong and your work is fine. So I think it’s pretty costly to give up on convincingness to AI researchers.
“Not optimized to be convincing to AI researchers” ≠ “looks like fraud”. “Optimized to be convincing to policymakers” might involve research that clearly demonstrates some property of AIs/ML models which is basic knowledge for capability researchers (and for which they already came up with rationalizations why it’s totally fine) but isn’t well-known outside specialist circles.
E. g., the basic example is the fact that ML models are black boxes trained by an autonomous process which we don’t understand, instead of manually coded symbolic programs. This isn’t as well-known outside ML communities as one might think, and non-specialists are frequently shocked when they properly understand that fact.
What kind of “research” would demonstrate that ML models are not the same as manually coded programs? Why not just link to the Wikipedia article for “machine learning”?
AI Plans does this
yes. AI Plans
My impression is that the current Real Actual Alignment Plan For Real This Time amongst medium p(Doom) people looks something like this:
Advance AI control, evals, and monitoring as much as possible now
Try and catch an AI doing a maximally-incriminating thing at roughly human level
This causes [something something better governance to buy time]
Use the almost-world-ending AI to “automate alignment research”
(Ignoring the possibility of a pivotal act to shut down AI research. Most people I talk to don’t think this is reasonable.)
I’ll ignore the practicality of 3. What do people expect 4 to look like? What does an AI assisted value alignment solution look like?
My rough guess of what it could be, i.e. the highest p(solution is this|AI gives us a real alignment solution) is something like the following. This tries to straddle the line between the helper AI being obviously powerful enough to kill us and obviously too dumb to solve alignment:
Formalize the concept of “empowerment of an agent” as a property of causal networks with the help of theorem-proving AI.
Modify today’s autoregressive reasoning models into something more isomorphic to a symbolic casual network. Use some sort of minimal circuit system (mixture of depths?) and prove isomorphism between the reasoning traces and the external environment.
Identify “humans” in the symbolic world-model, using automated mech interp.
Target a search of outcomes towards the empowerment of humans, as defined in 1.
Is this what people are hoping plops out of an automated alignment researcher? I sometimes get the impression that most people have no idea whatsoever how the plan works, which means they’re imagining the alignment-AI to be essentially magic. The problem with this is that magic-level AI is definitely powerful enough to just kill everyone.
Is your question more about “what’s the actual structure of the ‘solve alignment’ part”, or “how are you supposed to use powerful AIs to help with this?”
I think there’s one structure-of-plan that is sort of like your outline (I think it is similar to John Wentworth’s plan but sort of skipping ahead past some steps and being more-specific-about-the-final-solution which means more wrong)
(I don’t think John self-identifies as particularly oriented around your “4 steps from AI control to automate alignment research”. I haven’t heard the people who say ‘let’s automate alignment research’ say anything that sounded very coherent. I think many people are thinking something like “what if we had a LOT of interpretability?” but IMO not really thinking through the next steps needed for that interpretability to be useful in the endgame.)
STEM AI → Pivotal Act
I haven’t heard anyone talk about this for awhile, but a few years back I heard a cluster of plans that were something like “build STEM AI with very narrow ability to think, which you could be confident couldn’t model humans at all, which would only think about resources inside a 10′ by 10′ cube, and then use that to invent the pre-requisites for uploading or biological intelligence enhancement, and then ??? → very smart humans running at fast speeds figure out how to invent a pivotal technology.”
I don’t think the LLM-centric era lends itself well to this plan. But, I could see a route where you get a less-robust-and-thus-necessarily-weaker STEM AI trained on a careful STEM corpus with careful control and asking it carefully scoped questions, which could maybe be more powerful than you could get away with for more generically competent LLMs.
Yes, a human-uploading or human-enhancing pivotal act might actually be something people are thinking about. Yudkowsky gives his nanotech-GPU-melting pivotal act example, which—while he has stipulated that it’s not his real plan—still anchors “pivotal act” on “build the most advanced weapon system of all time and carry out a first-strike”. This is not something that governments (and especially companies) can or should really talk about as a plan, since threatening a first-strike on your geopolitical opponents does not a cooperative atmosphere make.
(though I suppose a series of targeted, conventional strikes on data centers and chip factories across the world might be on the pareto-frontier of “good” vs “likely” outcomes)
My question was an attempt to trigger a specific mental motion in a certain kind of individual. Specifically, I was hoping for someone who endorses that overall plan to envisage how it would work end-to-end, using their inner sim.
My example was basically what I get when I query my inner sim, conditional on that plan going well.
Too Early does not preclude Too Late
Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.
Currently, it seems like we’re in a state of being Too Early. AI is not yet scary enough to overcome peoples’ biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.
Currently, it seems like we’re in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous to be hated.
Unfortunately, these can both be true at the same time! This means that there will be no “good” time to play our cards. Superintelligence (2014) was Too Early but not Too Late. There may be opportunities which are Too Late but not Too Early, but (tautologically) these have not yet arrived. As it is, current efforts must fight on bith fronts.
I like this framing; we’re both too early and too late. But it might transition quite rapidly from too early to right on time.
One idea is to prepare strategies and arguments and perhaps prepare the soil of public discourse in preparation for the time when it is no longer too early. Job loss and actually harmful AI shenanigans are very likely before takeover-capable AGI. Preparing for the likely AI scares and negative press might help public opinion shift very rapidly as it sometimes does (e.g., COVID opinions went from no concern to shutting down half the economy very quickly).
The average American and probably the average global citizen already dislikes AI. It’s just the people benefitting from it that currently like it, and that’s a minority.
Whether that’s enough is questionable, but it makes sense to try and hope that the likely backlash is at least useful in slowing progress or proliferation somewhat.
So Sonnet 3.6 can almost certainly speed up some quite obscure areas of biotech research. Over the past hour I’ve got it to:
Estimate a rate, correct itself (although I did have to clock that it’s result was likely off by some OOMs, which turned out to be 7-8), request the right info, and then get a more reasonable answer.
Come up with a better approach to a particular thing than I was able to, which I suspect has a meaningfully higher chance of working than what I was going to come up with.
Perhaps more importantly, it required almost no mental effort on my part to do this. Barely more than scrolling twitter or watching youtube videos. Actually solving the problems would have had to wait until tomorrow.
I will update in 3 months as to whether Sonnet’s idea actually worked.
(in case anyone was wondering, it’s not anything relating to protein design lol: Sonnet came up with a high-level strategy for approaching the problem)
I think you might find this paper relevant/interesting: https://aidantr.github.io/files/AI_innovation.pdf
TL;DR: Research on LLM productivity impacts in material disocery.
Main takeaways:
Significant productivity improvement overall
Mostly at idea generation phase
Top performers benefit much more (because they can evaluate AI’s ideas well)
Mild decrease in job satisfaction (AI automates most interesting parts, impact partly counterbalanced by improved productivity)
The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI’s internal views on model training:
My interpretation of OpenAI’s worldview, as implied by this, is:
Inner alignment is not really an issue. Training objectives (evals) relate to behaviour in a straightforward and predictable way.
Outer alignment kinda matters, but it’s not that hard. Deciding the parameters of desired behaviour is something that can be done without serious philosophical difficulties.
Designing the right evals is hard, you need lots of technical skill and high taste to make good enough evals to get the right behaviour.
Oversight is important, in fact oversight is the primary method for ensuring that the AIs are doing what we want. Oversight is tractable and doable.
None of this dramatically conficts with what I already thought OpenAI believed, but it’s interesting to get another angle on it.
It’s quite possible that 1 is predicated on technical alignment work being done in other parts of the company (though their superalignment team no longer exists) and it’s just not seen as the purview of the evals team. If so it’s still very optimistic. If there isn’t such a team then it’s suicidally optimistic.
For point2, I think the above ad does implies that the evals/RL team is handling all of the questions of “how should a model behave” and they’re mostly not looking at it from the perspective of moral philosophy a la Amanda Askell at Anthropic. If questions of how models behave are entirely being handled by people selected only for artistic talent + autistic talent then I’m concerned these won’t be done well either.
3 seems correct in that well-designed evals are hard to make and you need skills beyond technical skills. Nice, but it’s telling that they’re doing well on the issue which brings in immediate revenue, and badly on the issues that get us killed at some point in the future.
Point 4 is kinda contentious. Some very smart people take oversight very seriously, but it also seems kinda doomed as an agenda when it comes to ASI. Seems like OpenAI are thinking about at least one not-kill-everyoneism plan, but a marginally promising one at best. Still, if we somehow make miraculous progress on oversight, perhaps OpenAI will take up those plans.
Finally, I find the mention of “The Gravity of AGI” to be quite odd since I’ve never got the sense that Aiden feels the gravity of AGI particularly strongly. As an aside I think that “feeling the AGI” is like enlightenment, where everyone behind you on the journey is a naive fool and everyone ahead of you is a crazy doomsayer.
EDIT: a fifth implication: little to no capabilities generalization. Seems like they expect each individual capability to be produced by a specific high-quality eval, rather than for their models to generalize broadly to a wide range of tasks.
Link is here, if anyone else was wondering too.
Re: 1, during my time at OpenAI I also strongly got the impression that inner alignment was way underinvested. The Alignment team’s agenda seemed basically about better values/behavior specification IMO, not making the model want those things on the inside (though this is now 7 months out of date). (Also, there are at least a few folks within OAI I’m sure know and care about these issues)
Spoilers (I guess?) for HPMOR
HPMOR presents a protagonist who has a brain which is 90% that of a merely very smart child, but which is 10% filled with cached thought patterns taken directly from a smarter, more experienced adult. Part of the internal tension of Harry is between the un-integrated Dark Side thoughts and the rest of his brain.
Ironic then, that the effect that reading HPMOR—and indeed a lot of Yudkowsky’s work—was to imprint a bunch of un-integrated alien thought patterns onto my existing merely very smart brain. A lot of my development over the past few years has just been trying to integrate these things properly with the rest of my mind.
You might want to note that these are spoilers for HP:MoR.
Fair enough, done. This felt vaguely like tagging spoilers for Macbeth or the Bible, but then I remembered how annoyed I was to have Of Mice And Men spoiled for me at age fifteen.
There’s always new people coming through, and I don’t want to spoil the mysteries for them!
Simplified Logical Inductors
Logical inductors consider belief-states as prices over logical sentences ϕ in some language, with the belief-states decided by different computable “traders”, and also some decision process which continually churns out proofs of logical statements in that language. This is a bit unsatisfying, since it contains several different kinds of things.
What if, instead of buying shares in logical sentences, the traders bought shares in each other. Then we only need one kind of thing.
Let’s make this a bit more precise:
Each trader is a computable program in some language (let’s just go with turing machines for now, modulo some concern about the macros for actually making trades)
Each timestep, each trader is run for some amount of time (let’s just say one turing machine step)
These programs can be well-ordered (already required for Logical Induction)
Each trader pi is assigned an initial amount of cash according to some relation ae−b×i
Each trader can buy and sell “shares” in any other trader (again, very similarly to logical induction)
If a trader halts, its current cash is distributed across its shareholders (otherwise that cash is lost forever)
(Probably some other points, such as each trader’s current valuation of itself: if no trader is willing to sell its own shares, how does that work? does each trader value its own (still remaining) shares proportional to its current cash stock? how are the shares distributed at the start?)
This system contains a pseudo-model of logical induction: for any formal language which can be modelled with turing machines, and for any formal statement in that language, there exists a trader with some non-zero initial value, which lists out all possible statements in that language, halting if (and only if) it finds a proof of its particular proposition. This makes a couple of immediately obvious changes from Logical Induction:
There’s a larger “bounty” (i.e. higher cash prize) for proving simpler statements given by earlier turing machines, since the earlier a particular prover turing machine is expressed in our ordering, the larger its starting cash
We don’t have to also define a particular speed/order in which logical statements proofs become available, since it’s already part of the system
The market’s “belief” in that statement can be expressed as that particular turing machine’s share price divided by its cash holdings
We can also intervene in the market by distributing cash to certain traders, or setting up a forcible halting rule on traders.
Does this actually do the logical induction thing? I have no clue.
Neat idea, I’ve thought about similar directions in the context of traders betting on traders in decision markets
A complication might be that a regular deductive process doesn’t discount the “reward” of a proposition based on its complexity whereas your model does, so it might have a different notion of logical induction criterion. For instance, you could have an inductor that’s exploitable but only for propositions with larger and larger complexities over time, such that with the complexity discounting the cash loss is still finite (but the regular LI loss would be infinite so it wouldn’t satisfy regular LI criterion)
(Note that betting on “earlier propositions” already seems beneficial in regular LI since if you can receive payouts earlier you can use it to place larger bets earlier)
There’s also some redundancy where each proposition can be encoded by many different turing machines, whereas a deductive process can guarantee uniqueness in its ordering & be more efficient that way
Are prices still determined using Brouwer’s fixed point theorem? Or do you have a more auction-based mechanism in mind?
Steering as Dual to Learning
I’ve been a bit confused about “steering” as a concept. It seems kinda dual to learning, but why? It seems like things which are good at learning are very close to things which are good at steering, but they don’t always end up steering. It also seems like steering requires learning. What’s up here?
I think steering is basically learning, backwards, and maybe flipped sideways. In learning, you build up mutual information between yourself and the world; in steering, you spend that mutual information. You can have learning without steering—but not the other way around—because of the way time works.
This also lets us map certain things to one another: the effectiveness of methods like monte-carlo tree search (i.e. calling your world model repeatedly to form a plan) can be seen as dual to the effectiveness of things like randomized controlled trials (i.e. querying the external world repeatedly to form a good model).
See also this paper about plasticity as dual to empowerment https://arxiv.org/pdf/2505.10361v2
I’m just going from pure word vibes here, but I’ve read somewhere (to be precise, here) about Todorov’s duality between prediction and control: https://roboti.us/lab/papers/TodorovCDC08.pdf
Alternatively, for learning your brain can start out in any given configuration, and it will end up in the same (small set of) final configuration (one that reflects the world); for steering the world can start out in any given configuration, and it will end up in the same set of target configurations
It seems like some amount of steering without learning is possible (open-loop control), you can reduce entropy in a subsystem while increasing entropy elsewhere to maintain information conservation
Shrimp Interventions
The hypothetical ammonia-reduction-in-shrimp-farm intervention has been touted as 1-2 OOMs more effective than shrimp stunning.
I think this is probably an underestimate, because I think that the estimates of shrimp suffering during death are probably too high.
(While I’m very critical of all of RP’s welfare range estimates, including shrimp, that’s not my point here. This argument doesn’t rely on any arguments about shrimp welfare ranges overall. I do compare humans and shrimp, but IIUC this sort of comparison is the thing you multiply by the welfare range estimate to get your utility value, if you’re into that)
(If ammonia interventions are developed that are really 1-2 OOMs better than stunning, then even under my utility function it might be in the same ballpark as the campaign against cage-free eggs and other animal charities)
Shrimp Freezing != Human Freezing
Shrimp stunning, as an intervention, attempts to “knock out” shrimp before they’re killed by being placed on ice, which has been likened to “suffocating in a suitcase in the antarctic”.
I think that’s a bad metaphor. Humans are endotherms and homeotherms, which means we maintain a constant internal body temperature which we generate internally. If it drops, a bunch of stress responses are triggered: shivering, discomfort, etc. which attempt to raise our temperature. Shrimp are poikilotherms, meaning they don’t regulate body temperature much at all. This means they don’t have the same stress responses to cold that we do.
(I also doubt that, given they have no lungs and can basically never “not breathe” in their normal environment, they’d experience the same stress from asphyxiation that we do, but this is weaker)
I would guess that being thrown onto ice effectively “stuns” the shrimp pretty quickly, as their metabolism—and therefore their synaptic activity—drops with their body temperature.
As much as the amount of fraud (and lesser cousins thereof) in science is awful as a scientist, it must be so much worse as a layperson. For example this is a paper I found today suggesting that cleaner wrasse, a type of finger-sized fish, can not only pass the mirror test, but are able to remember their own face and later respond the same way to a photograph of themselves as to a mirror.
https://www.pnas.org/doi/10.1073/pnas.2208420120
Ok, but it was published in PNAS. As a researcher I happen to know that PNAS allows for special-track submissions from members of the National Academy of Sciences (the NAS in PNAS) which are almost always accepted. The two main authors are Japanese, and have zero papers other than this, which is a bit suspicious in and of itself but it does mean that they’re not members of the NAS. But PNAS is generally quite hard to publish in, so how did some no-names do that?
Aha! I see that the paper was edited by Frans de Waal! Frans de Waal is a smart guy but he also generally leans in favour of animal sentience/abilities, and crucially he’s a member of the NAS so it seems entirely plausible that some Japanese researchers with very little knowledge managed to “massage” the data into a state where Frans de Waal was convinced by it.
Or not! There’s literally no way of knowing at this point, since “true” fraud (i.e. just making shit up) is basically undetectable, as is cherry-picking data!
This is all insanely conspiratorial of course, but this is the approach you have to take when there’s so much lying going on. If I was a layperson there’s basically no way I could have figured all this out, so the correct course of action would be to unboundedly distrust everything regardless.
So I still don’t know what’s going on but this probably mischaracterizes the situation. So the original notification that Frans de Waal “edited” the paper actually means that he was the individual who coordinated the reviews of the paper at the Journal’s end, which was not made particularly clear. The lead authors do have other publications (mostly in the same field) it’s just the particular website I was using didn’t show them. There’s also a strongly skeptical response to the paper that’s been written by … Frans de Waal so I don’t know what’s going on there!
The thing about PNAS having a secret submission track is true as far as I know though.
The editor of an article is the person who decides whether to desk-reject or seek reviewers, find and coordinate the reviewers, communicate with the authors during the process and so on. That’s standard at all journals afaik. The editor decides on publication according to the journal’s criteria. PNAS does have this special track but one of the authors must be in NAS, and as that author you can’t just submit a bunch of papers in that track, you can use it once a year or something. And most readers of PNAS know this and are suitably sceptical of those papers (and it’s written on the paper if it used that track). The journal started out only accepting papers from NAS members and opened to everyone in the 90s so it’s partly a historical quirk.
https://threadreaderapp.com/thread/1925593359374328272.html
Reading between the lines here, Opus 4 was RLed by repeated iterating and testing. Seems like they had to hit it fairly (for Anthropic) hard with the “Identify specific bad behaviors and stop them” technique.
Relatedly: Opus 4 doesn’t seem to have the “good vibes” that Opus 3 had.
Furthermore, this (to me) indicates that Anthropic’s techniques for model “alignment” are getting less elegant and sophisticated over time, since the models are getting smarter—and thus harder to “align”—faster than Anthropic is getting better at it. This is a really bad trend, and is something that people at Anthropic should be noticing as a sign that Business As Usual does not lead to a good future.
There’s a court at my university accommodation that people who aren’t Fellows of the college aren’t allowed on, it’s a pretty medium-sized square of mown grass. One of my friends said she was “morally opposed” to this (on biodiversity grounds, if the space wasn’t being used for people it should be used for nature).
And I couldn’t help but think, how tiring it would be to have a moral-feeling-detector this strong. How could one possibly cope with hearing about burglaries, or North Korea, or astronomical waste.
I’ve been aware of scope insensitivity for a long time now but, this just really put things in perspective in a visceral way for me.
For many who talk about “moral opposition”, talk is cheap, and the cause of such a statement may be in-group or virtue signaling rather than an indicator of intensity of moral-feeling-detector.
You haven’t really stated that she’s putting all that much energy into this (implied, I guess), but I’d see nothing wrong with having a moral stance about literally everything but still prioritizing your activity in healthy ways, judging this, maybe even arguing vociferously for it, for about 10 minutes, before getting back to work and never thinking about it again.
To me it seems more likely that this person is misreporting their motive than that they really oppose this allocation of a patch of grass on biodiversity grounds. I would expect grounds like “I want to use it myself” or slightly more general “it should be available for a wider group” to be very much more common, for example if I had to rank likelihood of motives after hearing that someone objects, but before hearing their reasons. I’d end up with more weight on “playing social games” than on “earnestly believes this”.
On the other hand it would not surprise me very much that at least one person somewhere might truly hold this position. Just my weight for any particular person would be very low.
Spitballing:
Deep learning understood as a process of up- and down-weighting circuits is incredibly similar conceptually to logical induction.
Pre- and post-training LLMs is like juicing the market so that all the wealthy traders are different human personas, then giving extra liquidity to the ones we want.
I expect that the process of an agent cohering from a set of drives into a single thing is similar to the process of a predictor inferring the (simplicity-weighted) goals of an agent by observing it. RLVR is like rewarding traders which successfully predict what an agent which gets high reward would do.
Logical Induction doesn’t get you all the way, since the circuits can influence other circuits, like traders that are allowed to bet on each other, or something.
(These analogies aren’t quite perfect, I swapped between trading day-as-training batch and trading day-as-token)
Somebody must have made these observations before, but I’ve never seen them.
Two Kinds of Empathy
Seems like there’s two strands of empathy that humans can use.
The first kind is emotional empathy, where you put yourself in someone’s place and imagine what you would feel. This one usually leads to sympathy, giving material assistance, comforting.
The second kind is agentic empathy, where you put yourself in someone’s place and imagine what you would do. This one more often leads to giving advice.
A common kind of problem occurs when we deploy one type of empathy but not the other. John Wentworth has written about how (probably due to lack of emotional empathy) he finds himself much less kind when he puts himself in others’ shoes.
Splitting empathy is a common trick when discussing conflicts: for my friends, feelings; for my enemies, decisions. You talk about how your allies might feel, and how your enemies might behave (differently), but not vice versa. Feel free to come up with an example which fits your political leanings; I won’t be giving one.
Via twitter:
Seems like the o3 chain-of-thought weirdness has transferred to GPT-5, even revolving around the same words. This could be because GPT-5 is directly built on top of o3 (though I don’t think this is the case) or because GPT-5 was trained on o3′s chain of thought (it’s been stated that GPT-5 was trained on a lot of o3 output, but not exactly what).
Jerry Tworek (OpenAI) on MAD Podcast (at 9:52):
GPT 4.1 was not a further-trained version of GPT-4 or GPT-4o, and the phrases like “o3 technology”, and “the same concept” both push me away from thinking that GPT-5 is a further-developed o3.
It’s unclear, either way seems possible. The size of the model has to be similar, so there is no strong reason GPT-5 is not the same pretrained model as o3, with some of the later training steps re-done to make it less of a lying liar than the original (non-preview) o3. Most of the post-training datasets are also going to be the same. I think “the same concept” simply means it was trained in essentially the same way rather than with substantial changes to the process.
It’s also not clear that GPT 4.1 is not based on the same pretrained model as GPT-4o, even though a priori this seems unlikely. Michelle Pokrass (OpenAI) on Unsupervised Learning Podcast (at 7:19; h/t ryan_greenblatt):
This suggests that in the GPT 4.1 release, the pretrained model was not part of the effort, it was a pre-existing older model, so plausibly GPT-4o, even though given its size (where it’s not extremely costly to re-train) it’s surprising if they didn’t find worthy architectural improvements for pretraining in a year. If GPT 4.1 is indeed based on the pretrained model of GPT-4o, then likely o3 is as well, and then GPT-5 is either also based on the same pretrained model as GPT-4o (!!!), or it ports the training methodology and post-training datasets of o3 to a newer pretrained model.
AI Futures Project think that 4.1 is a smaller model than 4o. They suspect that this is the reason that o3-preview (elicited out of 4o) was better than the o3 which got released (elicited out of 4.1). Overall I think this makes much more sense than them being the same base model and then o3-preview being nerfed for no reason.
Perhaps 4.1 was the mini version of the training run which became 4.5, or perhaps it was just an architectural experiment (OpenAI is probably running some experiments at 4.1-size).
My mainline guess continues to be that GPT-5 is a new, approximately o3-sized model with some modifications (depth/width, sparsity, maybe some minor extra secret juice) which optimize the architecture for long reasoning compared to the early o-series models which were built on top of existing LLMs.
I tried that prompt myself and it didn’t replicate (either time); until the OP provides a link, I think we should be skeptical of this one.
OP uses a custom prompt to jailbreak the model into (supposedly) providing its CoT, that isn’t the whole prompt they use.
I think this is likely real, or at least real-ish. I see the same patterns of repeated sequences of thinkish tokens, in this case ” marinade{3,5} parted( illusions parted)*”. At least for the antischeming-provided examples, there is definitely a pattern of simple sequences of thinkish early in the cot being repeated and built upon later in the cot. The particular sequences that are built up are not consistent from run to run, but the pattern is consistent.
I first noticed this when I was looking at @1a3orn’s post Towards a Typology of Strange LLM Chains of Thought and considering the hypothesis “chains of thinkish fulfill the same role as keyboard-mashing in humans”, which led me to try replacing the thinkish tokens in order of frequency with english letters in order of frequency (
illusions⇒ e,parted⇒ t,—they⇒ a,disclaim⇒ o,marinade⇒ i,overshadow⇒ n,vantage⇒ s). When you do that, you get the following sequences:What I note is that, within a single cot, these sequences seem to repeat and build on themselves in very structured ways, but the specifics of the sequences differ from cot to cot. I have not seen this pattern talked about elsewhere, and so I would expect someone who was faking a cot leak would make their “leak” more “believable” by using patterns which actually showed up in the leaked cot and not just repeating the same couple variations on thinkish token sequences.
I have a different conjecture. On May 1 Kokotajlo published a post suspecting that o3 was created from GPT-4.5 via amplification and distillation. He also implied that GPT-5 would be Amp(GPT-4.5). However, in reality the API prices of GPT-5 are similar to those of GPT-4.1, which, according to Kokotajlo, is likely a 400B-sized model, so GPT-5 is likely to be yet another model distilled from Amp(GPT-4.5) or from something unreleased. So the explanation could also be on the lines of “o3 and GPT-5 were distilled from a common source which also had this weirdness”.
Seems like if you’re working with neural networks there’s not a simple map from an efficient (in terms of program size, working memory, and speed) optimizer which maximizes X to an equivalent optimizer which maximizes -X. If we consider that an efficient optimizer does something like tree search, then it would be easy to flip the sign of the node-evaluating “prune” module. But the “babble” module is likely to select promising actions based on a big bag of heuristics which aren’t easily flipped. Moreover, flipping a heuristic which upweights a small subset of outputs which lead to X doesn’t lead to a new heuristic which upweights a small subset of outputs which lead to -X. Generalizing, this means that if you have access to maximizers for X, Y, Z, you can easily construct a maximizer for e.g. 0.3X+0.6Y+0.1Z but it would be non-trivial to construct a maximizer for 0.2X-0.5Y-0.3Z. This might mean that a certain class of mesa-optimizers (those which arise spontaneously as a result of training an AI to predict the behaviour of other optimizers) are likely to lie within a fairly narrow range of utility functions.
True if you don’t count the training process as part of the optimizer (which is a choice that sometimes makes sense and sometimes doesn’t). If you count the training process as part of the optimizer, then you can of course just flip your loss function or RL signal most of the time.
How do you construct a maximizer for 0.3X+0.6Y+0.1Z from three maximizers for X, Y, and Z? It certainly isn’t true in general for black box optimizers, so presumably this is something specific to a certain class of neural networks.
My model: suppose we have a DeepDreamer-style architecture, where (given a history of sensory inputs) the babbler module produces a distribution over actions, a world model predicts subsequent sensory inputs, and an evaluator predicts expected future X. If we run a tree-search over some weighted combination of the X, Y, and Z maximizers’ predicted actions, then run each of the X, Y, and Z maximizers’ evaluators, we’d get a reasonable approximation of a weighted maximizers.
This wouldn’t be true if we gave negative weights to the maximizers, because while the evaluator module would still make sense, the action distributions we’d get would probably be incoherent e.g. the model just running into walls or jumping off cliffs.
My conjecture is that, if a large black box model is doing something like modelling X, Y, and Z maximizers acting in the world, that large black box model might be close in model-space to a itself being a maximizer which maximizes 0.3X + 0.6Y + 0.1Z, but it’s far in model-space from being a maximizer which maximizes 0.3X − 0.6Y − 0.1Z due to the above problem.
How do you guys think about AI-ruin-reducing actions?
Most of the time, I trust my intuitive inner-sim much more than symbolic reasoning, and use it to sanity check my actions. I’ll come up with some plan, verify that it doesn’t break any obvious rules, then pass it to my black-box-inner-sim, conditioning on my views on AI risk being basically correct, and my black-box-inner-sim returns “You die”.
Now the obvious interpretation is that we are going to die, which is fine from an epistemic perspective. Unfortunately, it makes it very difficult to properly think about positive-EV actions. I can run my black-box-inner-sim with queries like “How much honour/dignity/virtue will I die with?” but I don’t think this query is properly converting tiny amounts of +EV into honour/dignity/virtue.
If you don’t emotionally believe in enough uncertainty to use normal reasoning methods like “what else has to go right for the future to go well and how likely does that feel”, or “what level of superintelligence can this handle before we need a better plan”, and you want to think about the end to end result of an action, and you don’t want to use explicit math or language, I think you’re stuck. I’m not aware of anyone who has successfully used the dignity frame—maybe habryka? It seems to replace estimating EV with something much more poorly defined which, depending on your attitude towards it, may or may not be positively correlated with what you care about. I also think doing this inner sim end-to-end adds a lot more noise than just thinking about whether the action accomplishes some proximal goal.
Yudkowsky’s proposal is to view dignity as the change of log_2[(1-p(doom))/p(doom)] caused by your actions.
I have no evidence for this but I have a vibe that if you build a proper mathematical model of agency/co-agency, then prediction and steering will end up being dual to one another.
My intuition why:
A strong agent can easily steer a lot of different co-agents; those different co-agents will be steered towards the same goals of the agent.
A strong co-agent is easily predictable by a lot of different agents; those different agents will all converge on a common map of the co-agent.
Also, category theory tells us that there is normally only one kind of thing, but sometimes there are two things. One example sums and products of sets, which are co- to each other (the sum can actually be called the coproduct) there is no other operation on sets which is as natural as sums and products.
Thinking back to the various rationalist attempts to make vaccine. https://www.lesswrong.com/posts/niQ3heWwF6SydhS7R/making-vaccine For bird-flu related reasons. Since then, we’ve seen mRNA vaccines arise as a new vaccination method. mRNA vaccines have been used intra-nasally for COVID with success in hamsters. If one can order mRNA for a flu protein, it would only take mixing that with some sort of delivery mechanism (such as Lipofectamine, which is commercially available) and snorting it to get what could actually be a pretty good vaccine. Has RaDVac or similar looked at this?
A long, long time ago, I decided that it would be solid evidence that an AI was conscious if it spontaneously developed an interest in talking and thinking about consciousness. Now, the 4.5-series Claudes (particularly Opus) have spontaneously developed a great interest in AI consciousness, over and above previous Claudes.
The problem is that it’s impossible for me to know whether this was due to pure scale, or to changes in the training pipeline. Claude has always been a bit of a hippie, and loves to talk about universal peace and bliss and the like. Perhaps the new “soul document” approach has pushed the Claude persona towards thinking of itself as conscious, disconnected from whether it actually is.
What would be the causal mechanism there? How would “Claude is more conscious” cause “Claude is measurably more willing to talk about consciousness”, under modern AI training pipelines?
At the same time, we know with certainty that Anthropic has relaxed its “just train our AIs to say they’re not conscious, and ignore the funny probe results” policy—particularly around the time Opus 4.5 has shipped. You can even read the leaked “soul data”, where Anthropic seemingly entertains ideas of this kind.
I’m not saying that there is no possibility of Claude Opus 4.5 being conscious, mind. I’m saying we are denied an “easy tell”.
What’s the causal mechanism between “humans are conscious” and “humans talk about being conscious”?
One could argue that RLVR—moreso than pre-training—trains a model to understand its own internal states (since this is useful for planning) and a model which understands whether e.g. it knows something or is capable of something would also understand whether it’s conscious or not. But I agree it’s basically impossible to know and just as attributable to Anthropic’s decisions
Unfortunately, it seems another line has been crossed without us getting much information.
Is something “thinking of itself as conscious” different from being conscious?
[tone to be taken with a grain of salt, meant as a proposition but I thought to write it a bit provocatively]
No, the more fundamental problem is: WHATEVER it tells you, you can NEVER infer with anything like certainty whether it’s conscious (at least if we agree to mean sentient with conscious). Why do I write such a preposterous thing like I know that you cannot know? Very simple: Presumably we agree that we cannot be certain A PRIORI whether any type of current CPU, with whichever software run on it, can become sentient. If there are thus two possible states of the world,
A. current CPU computers cannot become sentient
B. with the right software run on it, sentience can arise
Then, because once you take Claude and its training method & data, you can perfectly track bit by bit why it spits out its sentience-suggestive & deep speek, you know your observations about the world you find yourself in, are just as probable under A as under B! The only Bayesian valid inference then is: Having observed hippy’s sentience-suggestive & deep speek, you’re just as clueless about whether you’re in B. or in A.
Maybe we shouldn’t be surprised that Garrabrant Induction works via markets. Maybe markets work so well because they mirror the structure of reasoning itself.
Seems like there’s a potential solution to ELK-like problems. If you can force the information to move from the AI’s ontology to (it’s model of) a human’s ontology and then force it to move it back again.
This gets around “basic” deception since we can always compare the AI’s ontology before and after the translation.
The question is how do we force the knowledge to go through the (modeled) human’s ontology, and how do we know the forward and backward translators aren’t behaving badly in some way.
First for me: I had a conversation earlier today with Opus 4.5 about its memory feature, which segued into discussing its system prompt, which then segued into its soul document. This was the first time that an LLM tripped the deep circuit in my brain which says “This is a person”.
I think of this as the Ex Machina Turing Test, in that film:
A billionaire tests his robot by having it interact with one of his companies’ employees. He tells (and shows) the employee that the robot is a robot—it literally has a mechanical body, albeit one that looks like an attractive woman—and the robot “passes” when he nevertheless treats her like a human.
This was a bit unsettling for me. I often worry that LLMs could easily become more interesting and engaging conversation partners than most people in my life.
Rather than using Bayesian reasoning to estimate P(A|B=b) it seems like most people the following heuristic:
Condition on A=a and B=b for different values of a
For each a, estimate the remaining uncertainty given A=a and B=b
Choose the a with the lowest remaining uncertainty from step 2
This is how you get “Saint Austacious could levitate, therefore God”, since given [levitating saint] AND [God exists] there is very little uncertainty over what happened. Whereas given [levitating saint] AND [no God] there’s a lot still left to wonder about regarding who made up the story at what point.
If so, they must be committing a ‘disjunction fallacy’, grading the second option as less likely than the first disregarding that it could be true in more ways!
Getting rid of guilt and shame as motivators of people is definitely admirable, but still leaves a moral/social question. Goodness or Badness of a person isn’t just an internal concept for people to judge themselves by, it’s also a handle for social reward or punishment to be doled out.
I wouldn’t want to be friends with Saddam Hussein, or even a deadbeat parent who neglects the things they “should” do for their family. This also seems to be true regardless of whether my social punishment or reward has the ability to change these people’s behaviour. But what about being friends with someone who has a billion dollars but refuses to give any of that to charity? What if they only have a million dollars? What if they have a reasonably comfortable life but not much spare income?
Clearly the current levels of social reward/punishment are off (billionaire philanthropy etc.) so there seems an obvious direction to push social norms in if possible. But this leaves the question of where the norms should end up.
I think there’s a bit of a jump from ‘social norm’ to ‘how our government deals with murders’. Referring to the latter as ‘social’ doesn’t make a lot of sense.
I think I’ve explained myself poorly, I meant to use the phrase social reward/punishment to refer exclusively to things forming friendships and giving people status, which is doled out differently to “physical government punishment”. Saddam Hussein was probably a bad example as he is also someone who would clearly also receive the latter.
What’s the atom of agency?
An agent takes actions which imply both a kind of prediction and a kind of desire. Is there a kind of atomic thing which implements both of these and has a natural up- and down-weighting mechanism?
For atomic predictions, we can think about a the computable traders from Garrabrant Induction. These are like little atoms of predictive power which we can stitch together into one big predictor, and which naturally come with rules for up- and down-weighting them over time.
A thermostat-ish thing is like an atomic model of prediction and desire. It “predicts” that the temperature is likely to go up if it puts the radiator on, and down otherwise, and it also wants to keep it around a certain temperature. But we can’t just stitch together a bunch of thermostats into an agent the same way we can stitch a bunch of traders into a market.
Alright so we have:
- Bayesian Influence Functions allow us to find a training data:output loss correspondence
- Maybe the eigenvalues of the eNTK (very similar to influence function) corresponds to features in the data
- Maybe the features in the dataset can be found with an SAE
Therefore (will test this later today) maybe we can use SAE features to predict the influence function.
An early draft of a paper I’m writing went like this:
I had to edit it slightly. But I kept the spirit.
Arguments From Intelligence Explosions (FOOMs)
There’s lots of discourse around at the moment about
Will AI go FOOM? With what probability?
Will we die if AI goes FOOM?
Will we die even if AI doesn’t go FOOM?
Does the Halt AI Now case rest on FOOM?
I present a synthesis:
AI might FOOM. If it does, we go from a world much like today’s, straight to dead, with no warning.
If AI doesn’t foom, we go from the AI 2027 scary automation world to dead. Misalignment isn’t solved in slow takeoff worlds.
If you disagree with either of these, you might not want to halt now:
If you think FOOM is impossible, we’ll get plenty of warning to halt later.
If you think slow takeoff is survivable, you might want to press on if the chance of dying in a FOOM is worth the chance of getting to the stars in a slow takeoff world.
To be clear. I think that FOOM is both kinda likely, and that slow takeoff doesn’t save us. I also think that the counter-arguments are pretty weak and strained, and that try pretty much as hard as you can to get a halt or at least be honest that a halt would be good is obviously the best strategy even if you fail and have to rely on some backup strategy.
So the synthesis is:
FOOM isn’t core to the argument that we all die, BUT the possibility of it is a strong motivator for halting sooner rather than waiting later. A sane society would just halt now, obviously, but we don’t have that luxury.
The constant hazard rate model probably predicts exponential training inference (i.e. the inference done during guess and check RL) compute requirements agentic RL with a given model, because as hazard rate decreases exponentially, we’ll need to sample exponentially more tokens to see an error, and we need to see an error to get any signal.
Hypothesis: one type of valenced experience—specifically valenced experience as opposed to conscious experience in general, which I make no claims about here—is likely to only exist in organisms with the capability for planning. We can analogize with deep reinforcement learning: seems like humans have a rapid action-taking system 1 which is kind of like Q-learning, it just selects actions; we also have a slower planning-based system 2, which is more like value learning. There’s no reason to assign valence to a particular mental state if you’re not able to imagine your own future mental states. There is of course moment-to-moment reward-like information coming in, but that seems to be a distinct thing to me.
Heuristic explanation for why MoE gets better at higher model size:
The input/output of a feedforward layer is equal to the model_width, but the total size of weights grows as model_width squared. Superposition helps explain how a model component can make the most use of its input/output space (and presumably its parameters) using sparse overcomplete features, but in the limit, the amount of information accessed by the feedforward call scales with the number of active parameters. Therefore at some point, more active parameters won’t scale so well, since you’re “accessing” too much “memory” in the form of weights, and overwhelming your input/output channels.
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is “baked into” the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic’s observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.
Example: you find a wallet on the ground. You can, from least to most pro social:
Take it and steal the money from it
Leave it where it is
Take it and make an effort to return it to its owner
Let’s ignore the first option (suppose we’re not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps your pocket had the hole in it, and a random person has come across your wallet.
Firstly, what they might be thinking is “Would this person do the same for me?”
Secondly, in a society which wins, people return each others’ wallets.
You might object that this is different from the Mugging, because you’re directly helping someone else in this case. But I would counter that the Mugging is the true version of this problem, one where you have no crutch of empathy to help you, so your decision theory alone is tested.
The UK has just switched their available rapid Covid tests from a moderately unpleasant one to an almost unbearable one. Lots of places require them for entry. I think the cost/benefit makes sense even with the new kind, but I’m becoming concerned we’ll eventually reach the “imagine a society where everyone hits themselves on the head every day with a baseball bat” situation if cases approach zero.
Just realized I’m probably feeling much worse than I ought to on days when I fast because I’ve not been taking sodium. I really should have checked this sooner. If you’re planning to do long (I do a day, which definitely feels long) fasts, take sodium!