It wouldn’t just guess the next line of the proof, it’d guess the next conjecture. It could translate our math papers into proofs that compile, then write libraries that reduce code duplication. I expect canonical solutions to our confusions to fall out of a good enough such library.

Let’s taboo “care”. https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn’t think of as “caring” about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn’t picked.

It predicts a higher value of m in a version of its world where the program I described outputs 1) than one where it outputs 2), so it outputs 1).

Misunderstanding? Suppose we set w to “A game of chess where every move is made according to the outputs of this algorithm” and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as “the number stored in register #74457 on computer #3737082 in w” (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.

Suppose it’s superintelligent in the sense that it’s good at answering hypothetical questions of form “How highly will world w score on metric m?”. Then you set w to its world, m to how many paperclips w has, and output actions that, when added to w, increase its answers.

I don’t follow why we shouldn’t use neural networks to find isomorphic structures. Yes, asking a liar if he’s honest is silly; but 1. asking a traitor to annotate their thought transcripts still makes it harder for them to plan a coup if we can check some of their work; 2. if current networks can annotate current networks, we might fix the latter’s structure and scale its components up in a manner less likely to shift paradigms.

If corrigibility has one central problem, I would call it: How do you say “If A, then prefer B.” instead of “Prefer (if A, then B).”? Compare pytorch’s detach, which permits computation to pass forward, but prevents gradients from propagating backward, by acting as an identity function with derivative 0.

If some quantification is correct I expect it to have fewer free parameters.

If it expects that it would get astronomically more utility if

*it*rather than some random future AI wins, it might build a new version of itself that is only 1% likely to be aligned to it.The alignment problem doesn’t get

*that*much more difficult as your utility function gets more complicated. Once you’ve figured out how to get a superintelligent AGI to*try*to be aligned to you, you can let*it*figure out the mess of your goals.

You could theoretically finagle a language model to produce them.

Surely you can do some interpretability work without knowing what the symbols mean, such as identifying a portion that implements a fourier transform and, as in code review, replacing it with a library call.

How do you decide between a certainty of pepperoni and a coinflip of mushroom or anchovy?

The agent I’m talking about is separate from your physics-based world. It’s from toy setups like Robust Cooperation in the Prisoner’s Dilemma. If it can reason about statements like “If my algorithm returns that I cooperate, then I get 3 utility.”, then there may be p for which it can prove “If my algorithm returns that I cooperate, then this strange hypothetical-physics-based world has property p.” but not “This strange hypothetical-physics-based world has property p.”. This would indicate that that strange world contains agents about which that premise is useful, so we can use modal combatants as agent detectors.

Suppose an agent is thinking about whether to cooperate in a Prisoner’s Dilemma. In the counterfactual where it cooperates, it might naturally deduce that other agents like it would also cooperate. Therefore we could hand it a world with weird physics, and see whether in the counterfactual where it cooperates, it can deduce more about that world. Then it has presumably found agents like itself in that world.

Point-estimated Drake equation parameters cause the Fermi paradox, point-estimated utility functions cause paperclips and I don’t expect a point-estimated Markov blanket to go well here either.

On a prediction market platform, consider the graph of accounts and money flows between them. Some metrics take the form of adding up how much money flows across a boundary, subtracting backwards flows: How much currency is ever minted and paid out to users; how much currency is ever removed from circulation via fees. But for measuring how well the platform funnels money into smarter hands, it’s not clear how to draw a boundary around “the smart accounts”. Do you agree that analogous generalizations of Markov blankets should be applicable for both problems?

Is physics computable? That’s an open question.

Do you agree that there is a set of equations that precisely describes the universe? You can compute the solutions for any system of differential equations through an infinite series of ever finer approximations.

there’s no guarantee that the laws of physics would necessarily generate conscious beings

The Turing machine might calculate the entire tree of all timelines, including this conversation. Do you suggest that there is a manner in which one can run a universe, that only starts to make a difference once life gets far enough, without which the people in it would fail to talk about consciousness?

If we wrote out a complete log of that tree on a ludicrously large piece of paper, and then walked over to the portion of it that describes this conversation, I am not claiming that we should treat the transcript as something worth protecting. I’m claiming that whatever the characters in the transcript have, that’s all we have.

If one might make a conscious being out of Silicon but not out of a Turing machine, what happens when you run the laws of physics on a Turing machine and have simulated humans arise for the same reason they did in our universe, which have conversations like ours?

I think that all that can’t be an illusion is consciousness. All that certainly exists is consciousness.

What do you mean by “certainly exists”? One sure could subject someone to an illusion that he is not being subjected to an illusion.

Epidemics move on a log scale. Sure, the final expected utility calculation is dominated by the question of whether everyone gets infected. Instrumentally, sqrt(everyone) getting infected is, to first order, halfway there.

