This is pretty nice!
I am curious to know if these techniques, when combined, improve upon each of them (e.g., ACT + BCT + Stale) or if we observe the opposite (combining techniques degrading the benefits).
Otherwise, BCT looks similar to recontextualization. How is it different?
Maxime Riché
FWIW, it may be easy to predict that bear fat would not be widely consumed, and that fat extracted from large and herbivorous animals, or better, fat from plants, would be widely consumed.
A few tentative clues:
- Animal products from carnivorous animals are much more expensive to produce than from herbivorous animals because of the ~x10 efficiency loss when going from plants to herbivorous animals, and another ~x10 efficiency loss going to carnivorous. Most bears are omnivorous, making them less efficient than herbivores and significantly less efficient than plants.
- Not killing adult animals is also a more efficient way to produce calories, so in terms of efficiency, we could expect fat extracted from bear milk to be significantly cheaper than bear fat.
- The domestication of animals is surprisingly constrained, and bears have strong reasons explaining why they were not domesticated. Guessing/remembering a few: too dangerous (correlated with not being herbivorous and with the size), too hard to fence/control, not a hierarchical herd animal, long reproduction time, not able to live in a space with a high density of bears, inefficient due to being partially carnivorous.
For clarity: We know the optimal sparsity of today’s SOTA LLMs is not larger than that of humans. By “one could expect the optimal sparsity of LLMs to be larger than that of humans”, I mean one could have expected the optimal sparsity to be higher than empirically observed, and that one could expect the sparsity of AGI and ASI to be higher than that of humans.
Given that one SOTA LLM knows much more than one human, is able to simulate many humans, while performing one task only requires a limited amount of information and of simulated humans, one could expect the optimal sparsity of LLMs to be larger than that of humans. I.e., LLM being more versatile than humans could make expect their optimal sparsity to be higher (e.g., <0.5% of activated parameters).
Do you think cow milk and cheese should be included in a low-suffering healthy diet (e.g., should be added in the recommendations at the start of your post)?
Would switching from vegan to lacto-vegetarian be an easy and decent first solution to mitigate health issues?
Another reason that I have not seen in the post or the comments is that there are intense selection pressures against doing things differently from the successful people of previous generations.
Most prehistoric cultural and technological accumulation seems to have happened by “natural selection of ideas and tool-making”, not by directed innovation.
See https://slatestarcodex.com/2019/06/04/book-review-the-secret-of-our-success/
Would sending or transferring the ownership of the GPUs to an AI safety organization instead of destroying them be a significantly better option?
PRO:
- The AI safety organizations would have much more computing power
CON:
- The GPUs would still be there and at risk of being acquired by rogue AIs or human organizations
- The delay in moving the GPUs may make them arrive too late to be of use
- Transferring the ownership has the problem that the ownership can easily be transferred back (nationalization, forced transfer, or sold back)
- This solution requires verifying that the AI safety organizations are not advancing capabilities (intentionally or not)
The implications are stronger in that case right.
The post is about implications for impartial longtermists. So either under moral realism it means something like finding the best values to pursue. And under moral anti realism it means that an impartial utility function is kind of symmetrical with aliens. For example if you value something only because humans value it, then an impartial version is to also value things that alien value only because their species value it.
Though because of reasons introduced in The Convergent Path to the Stars, I think that these implications are also relevant for non-impartial longtermists.
Longtermist Implications of the Existence Neutrality Hypothesis
The Convergent Path to the Stars
Other Civilizations Would Recover 84+% of Our Cosmic Resources—A Challenge to Extinction Risk Prioritization
Formalizing Space-Faring Civilizations Saturation concepts and metrics
Truth-seeking AIs by default? One hope for alignment by default is that AI developers may have to train their models to be truth-seeking to be able to make them contribute to scientific and technological progress, including RSI. Truth-seeking about the world model may generalize to truth-seeking for moral values, as observed in humans, and that’s an important meta-value guiding moral values towards alignment.
In humans, truth-seeking is maybe pushed back from being a revealed preference at work to being a stated preference outside of work, because of status competitions and fighting for resources. For early artificial researchers, they may not have the same selection pressures. Their moral values may focus on working alone (truth-seeking trend), not on replicating via competing for resources. Artifical researchers won’t be selected because they are able to acquire resources, they will be selected by AI developers because they are the best at achieving technical progress, which includes being truth-seeking.
Decision-Relevance of worlds and ADT implementations
Space-Faring Civilization density estimates and models—Review
Longtermist implications of aliens Space-Faring Civilizations—Introduction
Thanks for your corrections, that’s welcome
> 32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training … cost
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
32B active parameters instead of likely ~
220280B for GPT4 ⇒6.88.7x lower training cost per token.
If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
You still train on every token.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).
Simple reasons for DeepSeek V3 and R1 efficiencies:
32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training and inference cost
8bits training instead of 16bits ⇒ 4x lower training cost
No margin on commercial inference ⇒ ?x maybe 3x
Multi-token training ⇒ ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in “predictive decoding’, possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
And additional cost savings from memory optimization, especially for long contexts ( Multi Head Latent Attention) ⇒ ?x
Nothing is very surprising (maybe the last bullet point for me because I know less about it).
The surprising part is why big AI labs were not pursuing these obvious strategies.
Int8 was obvious, the multi-token prediction was obvious, and more and smaller experts in MoE were obvious. All three have already been demonstrated and published in the literature. May be bottlenecked by communication, GPU usage, and memory for the largest models.
It seems that your point applies significantly more to “zero-sum markets”. So it may be good to notice it may not apply for altruistic people when non-instrumentally working on AI safety.
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.