There’s a background assumption in these discussions about anthropics, that there is a single correct answer, but I think that the correct probability distribution depends on what your aim is.

Say you’re living in a civilization on a continent, and you’re not sure whether there’s another civilization on a faraway continent. God speaks, and tells you that before He created the world, He wasn’t sure whether to make one populated continent or two, so He flipped a coin to decide. Heads one continent, tails two. What is the probability that there is a second civilization on your world?

Say your government is deciding whether to send a sailing expedition to search for the second civilization. If you’re alone, then the fruitless expedition costs -$3 million. If you’re not alone, you find a trading partner, and net +$2 million.

There are two possible worlds: should the possible single civilization lose $3 million, in order for the possible two civilizations to each gain $2 million? If you want to maximize expected average wealth, the answer is no, and if you want to maximize expected total wealth, the answer is yes. This preference induces a probability distribution: either SIA or SSA, depending on whether you care about the average or total.

What I don’t get, is what the answer is if you want to maximize expected *personal* wealth. (That is, the wealth of your civilization, ignoring others.) I notice I am confused. I almost feel like the question is ill-defined, though I don’t know why it would be. I guess this question is what anthropics is about, and I just answered an easier question above. Maybe we should be looking for the gap between the two?

(I made this point before, though less straightforwardly.)

I have an idea for testing this approach, before getting authors to write tens of thousands of pages of annotated dungeon tests.

It’s hard to generate explanations of

prose, but easy, for a computer, to generate explanations of particular subsets ofmath. For example, WolframAlpha can explain its reasoning for finding the derivative of a polynomial (click “step by step solution”, then “show all steps”): Wolfram Alpha derivative exampleThere’s a wide variety of math problems which we can programmatically solve, and can therefore programmatically generate explanations for:

Arithmetic, like step-by-step long division

Derivatives over a large set of operations (but not integrals; those are harder)

Subsets of logic

Subsets of integer programming

Some varieties of logic puzzles, like “knights and knaves” and “Alice, Beth, and Cara live in houses 1, 2, and 3, and have favorite colors Red, Green, and Blue non-respectively; here are some clues to figure out which is which”.

Simple algebra, like multiplying polynomials

(Actually, most of these are probably too hard to learn. Should focus on the really simple ones like long division.)

The idea is to:

Programmatically generate a large quantity of a small variety of math problems with explanations; then

Train one transformer on just the problem and final answer; and

Train another transformer on the problem, explanation, and final answer.

This is a very different domain than English prose, so it won’t tell you anything definitive about that more important domain. But it’s easier to do, and it shouldn’t carry any risk of advancing AI capabilities, since the training set is already (by definition) something we can already solve more accurately by other means.

I imagine you could learn a few things about how the explanations influence the AI:

You can see whether the explanation helps teach the AI, by checking whether the second transformer outperforms the first.

You can see whether the AI actually “uses” the explanation, by looking at the pattern of mistakes. If the AI frequently bungles the explanation while writing down the correct final answer, it must be generating the explanation and answer separately. This would be a bad sign for “visible thought” alignment.

You can see whether the AI naturally “hides” mistakes in its reasoning. I wouldn’t be surprised to frequently see a chain of reasoning “A → B → C → D → E → F”, where A, B, E, and F are right and C and D are wrong, since it’s often easier to check the beginning and end of a proof. For example, students do this sometimes.