But in fact, I expect the honest policy to get significantly less reward than the training-game-playing policy, because humans have large blind spots and biases affecting how they deliver rewards.
The difference in reward between truthfulness and the optimal policy depends on how humans allocate rewards, and perhaps it could be possible to find a clever strategy for allocating rewards such that truthfulness gets close to optimal reward.
For instance, in the (unrealistic) scenario in which a human has a well-specified and well-calibrated probability distribution over the state of the world, so that the actual state of the world (known to the AI) is randomly selected from this distribution, the most naive way to allocate rewards would be to make the loss be the log of the probability the human assigns to the answers given by the AI (so it gets better performance by giving higher-probability answers). This would disincentivize answering questions honestly if the human is often wrong. A better way to allocate rewards would be to ask a large number of questions about the state of the world, and, for each simple-to-describe property that an assignment of answers to each of these questions could have, which is extremely unlikely according to the human’s probability distribution (e.g. failing calibration tests), penalize assignments of answers to questions that satisfy this property. That way answering according to a random selection from the human’s probability distribution (which we’re modeling the actual state of the world as) will get high reward with high probability, while other simple-to-describe strategies for answering questions will likely have one of the penalized properties and get low reward.
Of course, this doesn’t work in real life because the state of the world isn’t randomly selected from human’s beliefs. Human biases make it more difficult to make truthfulness get close to optimal reward, but not necessarily impossible. One possibility would be to only train on questions that the human evaluators are extremely confident of the correct answers to, in hopes that they can reliably reward the AI more for truthful answers than for untruthful ones. This has the drawback that there would be no training data for topics that humans are uncertain about, which might make it infeasible for the AI to learn about these topics. It sure seems hard to come up with a reward allocation strategy that allows questions on which the humans are uncertain in training but still makes truth-telling a not-extremely-far-from-optimal strategy, under realistic assumptions about how human beliefs relate to reality, but it doesn’t seem obviously impossible.
That said, I’m still skeptical that AIs can be trained to tell to truth (as opposed to say things that are believed by humans) by rewarding what seems like truth-telling, because I don’t share the intuition that truthfulness is a particularly natural strategy that will be easy for gradient descent to find. If it’s trained on questions in natural language that weren’t selected for being very precisely stated, then these questions will often involve fuzzy, complicated concepts that humans use because we find useful, even though they aren’t especially natural. Figuring out how to correctly answer these questions would require learning things about how humans understand the world, which is also what you need in order to exploit human error to get higher reward than truthfulness would get.
It sounds to me like, in the claim “deep learning is uninterpretable”, the key word in “deep learning” that makes this claim true is “learning”, and you’re substituting the similar-sounding but less true claim “deep neural networks are uninterpretable” as something to argue against. You’re right that deep neural networks can be interpretable if you hand-pick the semantic meanings of each neuron in advance and carefully design the weights of the network such that these intended semantic meanings are correct, but that’s not what deep learning is. The other things you’re comparing it to that are often called more interpretable than deep learning are in fact more interpretable than deep learning, not (as you rightly point out) because the underlying structures they work with is inherently more interpretable, but because they aren’t machine learning of any kind.
This seems related in spirit to the fact that time is only partially ordered in physics as well. You could even use special relativity to make a model for concurrency ambiguity in parallel computing: each processor is a parallel worldline, detecting and sending signals at points in spacetime that are spacelike-separated from when the other processors are doing these things. The database follows some unknown worldline, continuously broadcasts its contents, and updates its contents when it receives instructions to do so. The set of possible ways that the processors and database end up interacting should match the parallel computation model. This makes me think that intuitions about time that were developed to be consistent with special relativity should be fine to also use for computation.
Wikipedia claims that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I’m not sure.
By “optimal”, I mean in an evidential, rather than causal, sense. That is, the optimal value is that which signals greatest fitness to a mate, rather than the value that is most practically useful otherwise. I took Fisherian runaway to mean that there would be overcorrection, with selection for even more extreme traits than what signals greatest fitness, because of sexual selection by the next generation. So, in my model, the value of X that causally leads to greatest chance of survival could be −1, but high values for X are evidence for other traits that are causally associated with survivability, so X=0 offers best evidence of survivability to potential mates, and Fisherian runaway leads to selection for X=1. Perhaps I’m misinterpreting Fisherian runaway, and it’s just saying that there will be selection for X=0 in this case, instead of over-correcting and selecting for X=1? But then what’s all this talk about later-generation sexual selection, if this doesn’t change the equilibrium?
Ah, so if we start out with an average X=−10, standard deviation 1, and optimal X=0, then selecting for larger X has the same effect as selecting for X closer to 0, and that could end up being what potential mates do, driving X up over the generations, until it is common for individuals to have positive X, but potential mates have learned to select for higher X? Sure, I guess that could happen, but there would then be selection pressure on potential mates to stop selecting for higher X at this point. This would also require a rapid environmental change that shifts the optimal value of X; if environmental changes affecting optimal phenotype aren’t much faster than evolution, then optimal phenotypes shouldn’t be so wildly off the distribution of actual phenotypes.
Fisherian runaway doesn’t make any sense to me.
Suppose that each individual in a species of a given sex has some real-valued variable X, which is observable by the other sex. Suppose that, absent considerations about sexual selection by potential mates for the next generation, the evolutionarily optimal value for X is 0. How could we end up with a positive feedback loop involving sexual selection for positive values of X, creating a new evolutionary equilibrium with an optimal value X=1 when taking into account sexual selection? First the other sex ends up with some smaller degree of selection for positive values of X (say selecting most strongly for X=.5). If sexual selection by the next generation of potential mates were the only thing that mattered, then the optimal value of X to select for is .5, since that’s what everyone else is selecting for. That’s stability, not positive feedback. But sexual selection by the next generation of potential mates isn’t the only thing that matters; by stipulation, different values of X have effects on evolutionary fitness other than through sexual selection, with values closer to 0 being better. So, when choosing a mate, one must balance the considerations of sexual selection by the next generation (for which X=.5 is optimal) and other considerations (for which X=0 is optimal), leading to selection for mates with 0<X<.5 being evolutionarily optimal. That’s negative feedback. How do you get positive feedback?
I know this was tagged as humor, but taking it seriously anyway,
I’m skeptical that breeding octopuses for intelligence would yield much in the way of valuable insights for AI safety, since octopuses and humans have so much in common that AGI wouldn’t. That said, it’s hard to rule out that uplifting another species could reveal some valuable unknown unknowns about general intelligence, so I unironically think this is a good reason to try it.
Another, more likely to pay off, benefit to doing this would be as a testbed for genetically engineering humans for higher intelligence (which also might have benefits for AI safety under long-timelines assumptions). I also think it would just be really cool from a scientific perspective.
One example of a class of algorithms that can solve its own halting problem is the class of primitive recursive functions. There’s a primitive recursive function F that takes as input a description of a primitive recursive function A and input I and outputs true if A(I) halts, and false otherwise: this program is given by F(A,I)=true, because all primitive recursive functions halt on all inputs. In this case, it is R that does not exist.
I think C should exist, at least for classical bits (which as others have pointed out, is all that is needed), for any reasonably versatile model of computation. This is not so for R, since primitive recursion is actually an incredibly powerful model of computation; any program that you should be able to get an output from before the heat death of the universe can be written with primitive recursion, and in some sense, primitive recursion is ridiculous overkill for that purpose.
If a group decides something unanimously, and has the power to do it, they can do it. That would take them outside the formal channels of the EU (or in another context of NATO) but I do not see any barrier to an agreement to stop importing Russian gas followed by everyone who agreed to it no longer importing Russian gas. Hungary would keep importing, but that does not seem like that big a problem.
If politicians can blame Hungary for their inaction, then this partially protects them from being blamed by voters for not doing anything. But it doesn’t protect them at all from being blamed for high fuel prices if they stop importing it from Russia. So they have incentives not to find a solution to this problem.
If you have a 10-adic integer, and you want to reduce it to a 5-adic integer, then to know its last n digits in base 5, you just need to know what it is modulo 5n. If you know what it is modulo 10n, then you can reduce it module 5n, so you only need to look at the last n digits in base 10 to find its last n digits in base 5. So a base-10 integer ending in …93 becomes a base-5 integer ending in …33, because 93 mod 25 is 18, which, expressed in base 5, is 33.
The Chinese remainder theorem tells us that we can go backwards: given a 5-adic integer and a 2-adic integer, there’s exactly one 10-adic integer that reduces to each of them. Let’s say we want the 10-adic integer that’s 1 in base 5 and −1 in base 2. The last digit is the digit that’s 1 mod 5 and 1 mod 2 (i.e. 1). The last 2 digits are the number from 0 to 99 that’s 1 mod 25 and 3 mod 4 (i.e. 51). The last 3 digits are the number from 0 to 999 that’s 1 mod 125 and 7 mod 8 (i.e. 751). And so on.
(My intuition tells me that if you choose a subset of the primes dividing the base, you can somehow obtain from that a value of √1 in a way that maps “none” to 1, and “all” to −1, and the remaining combinations to the irrational integers. No more specific ideas.)
That is correct! In base pe11⋅...⋅penn (for distinct primes p1,...,pn), an integer is determined by an integer in each of the bases p1,...,pn, essentially by the Chinese remainder theorem. In other words, Qpe11⋅...⋅penn=Qp1×...×Qpn. In prime base, 1 and −1 are the only two square roots of 1. In arbitrary base, a number squares to 1 iff the number it reduces to in each prime factor of the base also squares to 1, and there’s 2 options for each of these factors.
Yes, I think we’re on the same page now.
the question is whether Ukraine will get defensive agreements in case of another Russian invasion. Since I do not believe such promises are worth much I expect this not to end up being a dealbreaker. … Ukraine is demanding ‘security guarantees’ as part of the deal that I am thinking they should actively not want, because they are not worth the paper they are printed on and that failure then weakens credibility. As it is now, either invading Ukraine again would create a broader war or it wouldn’t.
I’m curious why you see so little value in promises of protection. Is it because you believed the common misconception that the US and UK already made and have now broken such promises in the Budapest memorandum?
Ah, I think that is what I was talking about. By “actual utility”, you mean the sum over the utility of the outcome of each decision problem you face, right? What I was getting at is that your utility function splitting as a sum like this is an assumption about your preferences, not just about the relationship between the various decision problems you face.
Oh, are you talking about the kind of argument that starts from the assumption that your goal is to maximize a sum over time-steps of some function of what you get at that time-step? (This is, in fact, a strong assumption about the nature of the preferences involved, which representation theorems like VNM don’t make.)
If we call this construction F(W,V) then the construction I’m thinking of is F(W,V)⊗F(W,W)∗. Note that F(W,W) is locally 1-dimensional, so my construction is locally isomorphic to yours but globally twisted. It depends on W via more than just its local dimension. Also note that with this definition we will get that ΛV(V) is always isomorphic to R
Oh right, I was picturing W being free on connected components when I suggested that. Silly me.
If we pick a basis of Rn then it induces a bijection between Hom(Rn,V) and V×⋯×V. So we could define a map Hom(Rn,V)→U to be ‘alternating’ if and only if the corresponding map V×⋯×V→U is alternating. The interesting thing I noticed about this definition is that it doesn’t depend on which basis you pick for Rn. So I have some hope that since this construction isn’t basis dependent, I might be able to write down a basis-independent definition of it.
F is alternating if F(f∘g)=det(g)F(f), right? So if we’re willing to accept kludgy definitions of determinant in the process of defining ΛW(V), then we’re all set, and if not, then we’ll essentially need another way to define determinant for projective modules because that’s equivalent to defining an alternating map?
In either case, your utility function is meant to be constructed from your underlying preference relation over the set of alternatives for the given problem. The form of the function can be linear in some things or not, that’s something to be determined by your preference relation and not the arguments for EUM.
No, what I was trying to say is that this is true only for representation theorem arguments, but not for the iterated decisions type of argument.
Suppose your utility function is some monotonically increasing function of your eventual wealth. If you’re facing a choice between some set of lotteries over monetary payouts, and you expect to face an extremely large number of i.i.d. iterations of this choice, then by the law of large numbers, you should pick the option with the highest expected monetary value each time, as this maximizes your actual eventual wealth (and thus your actual utility) with probability near 1.
Or suppose you expect to face an extremely large number of similarly-distributed opportunities to place bets at some given odds at whatever stakes you choose on each step, subject to the constraint that you can’t bet more money than you have. Then the Kelly criterion says that if you choose the stakes that maximizes your expected log wealth each time, this will maximize your eventual actual wealth (and thus your actual utility, since that’s monotonically increasing with you eventual wealth) with probability near 1.
So, in the first case, we concluded that you should maximize a linear function of money, and in the second case, we concluded that you should maximize a logarithmic function of money, but in both cases, we assumed nothing about your preferences besides “more money is better”, and the function you’re told to maximize isn’t necessarily your utility function as in the VNM representation theorem. The shape of the function you’re told you should maximize comes from the assumptions behind the iteration, not from your actual preferences.
(Perhaps you were going to address this in a later post, but) the iterated decisions type of argument for EUM and the one-shot arguments like VNM seem not comparable to me in that they don’t actually support the same conclusions. The iterated decision arguments tell you what your utility function should be (linear in amount of good things if future opportunities don’t depend on past results; possibly nonlinear otherwise, as in the Kelly criterion), and the one-shot arguments importantly don’t, instead simply concluding that there should exist some utility function accurately reflecting your preferences.
I’m curious about this. I can see a reasonable way to define ΛW(V) in terms of sheaves of modules over Spec(R): Over each connected component, W has some constant dimension n, so we just let ΛW(V) be Λn(V) over that component. But it sounds like you might not like this definition, and I’d be interested to know if you had a better way of defining ΛW(V) (which will probably end up being equivalent to this). [Edit: Perhaps something in terms of generators and relations, with the generators being linear maps W→V?]
 Evidence from the response to the Russian invasion of Ukraine suggests that we’d go with option 2 and take major economic damage trying to throw China out of the world trade system while still failing to save Taiwan.
These are disanalogous in that the US has not previously implied an intention to go to war to defend Ukraine, but has (somewhat ambiguously) implied an intention to go to war to defend Taiwan. So the US declining to go to war to defend Ukraine is what you should have expected even if you also believed that the US would go to war to defend Taiwan.