Are you American? Because as a British person I would say that the first version looks a lot better to me, and certainly fits the standards for British non-fiction books better.
Though I do agree that the subtitle isn’t quite optimal.
Are you American? Because as a British person I would say that the first version looks a lot better to me, and certainly fits the standards for British non-fiction books better.
Though I do agree that the subtitle isn’t quite optimal.
Might be to avoid people stealing the unembedding matrix weights.
Out of domain (i.e. on a different math benchmark) the RLed model does better at pass@256, especially when using algorithms like RLOO and Reinforce++. If there is a crossover point it is in the thousands. (Figure 7)
This seems critically important. Production models are RLed on hundreds to thousands of benchmarks.
We should also consider that, well, this result just doesn’t pass the sniff test given what we’ve seen RL models do. o3 is a lot better than o1 in a way which suggests that RL budgets do scale heavily with xompute, and o3 if anything is better at scaling up in a Pass@N way (o3 is reported to be fully parallelizable, capable of scaling up to $1000s of compute).
Something is up here. Maybe it’s lack of test-time scaling, maybe OpenAI really do have a secret RL algorithm (nobody else has demonstrates the capability to scale up test-time compute in quite in the way that o3 can). Maybe the authors just did it wrong. Maybe the authors didn’t do enough RL (again, we know o3 used a lot of RL compute; the authors here only did 100s of steps)
Overall I don’t buy the conclusions of that paper.
OK so some further thoughts on this: suppose we instead just partition the values of directly by something like a clustering algorithm, based on in space, and take just be the cluster that is in:
Assuming we can do it with small clusters, we know that is pretty small, so is also small.
And if we consider , this tells us that learning restricts us to a pretty small region of space (since ) so should be approximately deterministic in . This second part is more difficult to formalize, though.
Edit: The real issue is whether or not we could have lots of values which produce the same distribution over but different distributions over , and all be pretty likely given for some . I think this just can’t really happen for probable values of , because if these values of produce the same distribution over , but different distributions over , then that doesn’t satisfy , and secondly because if they produced wildly different distributions over , then that means they can’t all have high values of , and so they’re not gonna have high values of .
Too Early does not preclude Too Late
Thoughts on efforts to shift public (or elite, or political) opinion on AI doom.
Currently, it seems like we’re in a state of being Too Early. AI is not yet scary enough to overcome peoples’ biases against AI doom being real. The arguments are too abstract and the conclusions too unpleasant.
Currently, it seems like we’re in a state of being Too Late. The incumbent players are already massively powerful and capable of driving opinion through power, politics, and money. Their products are already too useful and ubiquitous to be hated.
Unfortunately, these can both be true at the same time! This means that there will be no “good” time to play our cards. Superintelligence (2014) was Too Early but not Too Late. There may be opportunities which are Too Late but not Too Early, but (tautologically) these have not yet arrived. As it is, current efforts must fight on bith fronts.
Under this formulation, FEP is very similar to RL-as-inference. But RL-as-inference is a generalization of a huge number of RL algorithms from Q-learning to LLM fine-tuning. This does kind of make sense if we think of FEP as a just a different way of looking at things, but it doesn’t really help us narrow down the algorithms that the brain is actually using. Perhaps that’s actually all FEP is trying to do though, and Friston has IIRC said things to that effect—that FEP is just a reframing/generalization and not an actual model of the underlying algorithms being employed.
This seems not to be true assuming a P(doom) of 25% and a purely selfish perspective, or even a moderately altruistic perspective which places most of its weight on, say, the person’s immediate family and friends.
Of course any cryonics-free strategy is probably dominated by that same strategy plus cryonics for a personal bet at immortality, but when it comes to friends and family it’s not easy to convince people to sign up for cryonics! But immortality-maxxing for one’s friends and family almost definitely entails accelerating AI even at pretty high P(doom)
(And that’s without saying that this is very likely to not be the true reason for these people’s actions. It’s far more likely to be local-perceived-status-gradient-climbing followed by a post-hoc rationalization (which can also be understood as a form of local-perceived-status-gradient-climbing) and signing up for cryonics doesn’t really get you any status outside of the deepest depths of the rat-sphere, which people like this are obviously not in since they’re gaining status from accelerating AI)
Huh, I had vaguely considered that but I expected any terms to be counterbalanced by terms, which together contribute nothing to the KL-divergence. I’ll check my intuitions though.
I’m honestly pretty stumped at the moment. The simplest test case I’ve been using is for and to be two flips of a biased coin, where the bias is known to be either or with equal probability of either. As varies, we want to swap from to the trivial case and back. This (optimally) happens at around and . If we swap there, then the sum of errors for the three diagrams of does remain less than at all times.
Likewise, if we do try to define , we need to swap from a which is equal to the number of heads, to , and back.
In neither case can I find a construction of or which swaps from one phase to the other at the right time! My final thought is for to be some mapping consisting of a ball in probability space of variable radius (no idea how to calculate the radius) which would take at and at . Or maybe you have to map or something like that. But for now I don’t even have a construction I can try to prove things for.
Perhaps a constructive approach isn’t feasible, which probably means I don’t have quite the right skillset to do this.
I’ve been working on the reverse direction: chopping up by clustering the points (treating each distribution as a point in distribution space) given by , optimizing for a deterministic-in- latent which minimizes .
This definitely separates and to some small error, since we can just use to build a distribution over which should approximately separate and .
To show that it’s deterministic in (and by symmetry ) to some small error, I was hoping to use the fact that—given—has very little information about , so it’s unlikely that is in a different cluster to . This means that would just put most of the weight on the cluster containing .
A constructive approach for would be marginally more useful in the long-run, but it’s also probably easier to prove things about the optimal . It’s also probably easier to prove things about for a given number of clusters , but then you also have to prove things about what the optimal value of is.
Is the distinction between “elephant + tiny” and “exampledon” primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent “has a bright purple spleen” but exampledons do, then the model might need to instead produce a “purple” vector as an output from an MLP whenever “exampledon” and “spleen” are present together.
Just to clarify, do you mean something like “elephant = grey + big + trunk + ears + African + mammal + wise” so to encode a tiny elephant you would have “grey + tiny + trunk + ears + African + mammal + wise” which the model could still read off as 0.86 elephant when relevant, but also tiny when relevant.
I think you should pay in Counterfactual Mugging, and this is one of the newcomblike problem classes that is most common in real life.
Example: you find a wallet on the ground. You can, from least to most pro social:
Take it and steal the money from it
Leave it where it is
Take it and make an effort to return it to its owner
Let’s ignore the first option (suppose we’re not THAT evil). The universe has randomly selected you today to be in the position where your only options are to spend some resources to no personal gain, or not. In a parallel universe, perhaps your pocket had the hole in it, and a random person has come across your wallet.
Firstly, what they might be thinking is “Would this person do the same for me?”
Secondly, in a society which wins, people return each others’ wallets.
You might object that this is different from the Mugging, because you’re directly helping someone else in this case. But I would counter that the Mugging is the true version of this problem, one where you have no crutch of empathy to help you, so your decision theory alone is tested.
I have added a link to the report now.
As to your point: this is one of the better arguments I’ve heard that welfare ranges might be similar between animals. Still I don’t think it squares well with the actual nature of the brain. Saying there’s a single suffering computation would make sense if the brain was like a CPU, where one core did the thinking, but actually all of the neurons in the brain are firing at once and doing computations in at the same time. So it makes much more sense to me to think that the more neurons are computing some sort of suffering, the greater the intensity of suffering.
Good point, edited a link to the Google Doc into the post.
From Rethink Priorities:
We used Monte Carlo simulations to estimate, for various sentience models and across eighteen organisms, the distribution of plausible probabilities of sentience.
We used a similar simulation procedure to estimate the distribution of welfare ranges for eleven of these eighteen organisms, taking into account uncertainty in model choice, the presence of proxies relevant to welfare capacity, and the organisms’ probabilities of sentience (equating this probability with the probability of moral patienthood)
Now with the disclaimer that I do think that RP are doing good and important work and are one of the few organizations seriously thinking about animal welfare priorities...
Their epistemics led them to do a Monte Carlo simulation to determine if organisms are capable of suffering (and if so, how much) then got a value of 5 shrimp = 1 human and then not bat an eye at this number.
Neither a physicalist nor a functionalist theory of consciousness can reasonably justify a number like this. Shrimp have 5 orders of magnitude fewer neurons than humans, so whether suffering is the result of a physical process or an information processing one, this implies that shrimp neurons do 4 orders of magnitude more of this process per second than human neurons. The authors get around this by refusing to stake themselves on any theory of consciousness.
The overall structure of the RP welfare range report, does not cut to the truth, instead the core mental motion seems to be to engage with as many existing piece of work as possible; credence is doled out to different schools of thought and pieces of evidence in a way which seems more like appeasement, lip-service, or a “well these guys have done some work, who are we disrespect them by ignoring it” attitude. Removal of noise is one of the most important functions of meta-analysis, and it is largely absent.
The result of this is an epistemology where the accuracy of a piece of work is a monotonically increasing function of the number of sources, theories, and lines of argument. Which is fine if your desired output is a very long Google doc, and a disclaimer to yourself (and, more cynically, your funders) that “No no, we did everything right, we reviewed all the evidence and took it all into account.” but it’s pretty bad if you want to actually be correct.
I grow increasingly convinced that the epistemics of EA are not especially good, worsening, and already insufficient to work on the relatively low-stakes and easy issue of animal welfare (as compared to AI x-risk).
If we approximate an MLP layer with a bilinear layer, then the effect of residual stream features on the MLP output can be expressed as a second order polynomial over the feature coefficients $f_i$. This will contain, for each feature, an $f_i^2 v_i+ f_i w_i$ term, which is “baked into” the residual stream after the MLP acts. Just looking at the linear term, this could be the source of Anthropic’s observations of features growing, shrinking, and rotating in their original crosscoder paper. https://transformer-circuits.pub/2024/crosscoders/index.html
That might be true but I’m not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we’re doing parallel search like with o3) etc. and it’s not clear how the abstraction height will scale with those.
Empirically, I think lots of people feel the experience of “hitting a wall” where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?
I second this, it could easily be things which we might describe as “amount of information that can be processed at once, including abstractions” which is some combination of residual stream width and context length.
Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a wall; some physics students can handle first quantization but not second quantization) which sorta implies there’s a maximum abstraction stack height which a mind can handle, which varies continuously.
Only partially relevant, but it’s exciting to hear a new John/David paper is forthcoming!
My understanding was that diffusion refers to a training objective, and isn’t tied to a specific architecture. For example OpenAI’s Sora is described as a diffusion transformer. Do you mean you expect diffusion transformers to scale worse than autoregressive transformers? Or do you mean you don’t think this model is a transformer in terms of architecture.