Mathematical Logic graduate student, interested in AI Safety research for ethical reasons.
Martín Soto
Tbh I’m not that passionate, I feel like I’m just doing the necessary minimum to stay completely healthy?
Anyway, of course! But most are just really basic reviews/explainers that are one google search away, to get the big picture of what you need. Some of these are:
For waaay more stuff, see this vegan cheat sheet
And regardless of these resources you should of course visit a nutritionist (even if very sporadically, or even just once when you start being vegan) so that they can confirm the important bullet points, whether what you’re doing broadly works, and when you should worry about anything. (And again, anecdotically this has been strongly stressed and acknowledged as necessary by all vegans I’ve met, which are not few).
The nutritionist might recommend yearly (or less frequent) blood testing, which does feel like a good failsafe. I’ve been taking them for ~6 years and all of them have turned out perfect (I only supplement B12, as nutritionist recommended).
I guess it’s not that much that there’s some resource that is the be-all end-all on vegan nutrition, but more that all of the vegans I’ve met have forwarded a really positive health-conscious attitudes, and stressed the importance of this points.
I’ve also very sporadically engaged with more extensive dietary literature, but mainly in Spanish (for example, Lucía Martínez’s Vegetarianos con ciencia).
Nice! I guess I was just worrying about framing, since most people who see this will only skim, and they might get the impression that veganism per se induces deficiencies, instead of un-supplemented veganism (especially since I’ve already seen a comment about fixing this with non-vegan products, instead of the usual and recommended vegan supplementation).
It’s scarce and mostly extremely low quality.
And totally agree. Nonetheless, I do think available reviews can be called rich in comparison to a 5 or 20 person study.
I don’t doubt your anecdotal experience is as you’re telling it, but mine has been completely different, so much so that it sounds crazy to me to spend a whole year being vegan, and participating in animal advocacy, without hearing mention of B12 supplementation. Literally all vegans I’ve met have very prominently stressed the importance of dietary health and B12 supplementation. Heck, even all the vegan shitposts are about B12!
comparing [~vegans who don’t take supplements] to [omnivores who don’t take supplements] will give the clearest data
Even if that might be literally true for scientific purposes (and stressing again that the above project clearly doesn’t have robust scientific evidence as its goal), I do think this won’t be an accurate representation of the picture when presented to the community, since most vegans do supplement [citation needed, but it’s been my extensive personal and online experience, and all famous vegan resources I’ve seen stress this], and thus you’re comparing the non-average vegan to the average omnivore, giving a false sense of imbalance against veganism. As rational as we might try to be, framing of this kind matters, and we all are especially emotional and visceral with regards to something as intimate and personal as our diet. On average, people raised omnivores have strong repulsion towards veganism (so much so as to override ethical concerns), and I think we should take that into account.
Thanks for this post! I’m guessing the main drive for this project is just compellingly exemplifying that nutrient deficiency is also present in the rationalist community, so as for more people to treat their dietary health seriously? Even though there’s no a priori reason to expect nutrient deficiency not to happen amongst non-dietary-conscious rationalists.
I say that because (and sorry for maybe being blunt) the sample size is so small compared to the rich existing literature on this topic, that this feels more like an emotionally compelling “take care of yourself” advice than any scientifically relevant discovery.
On a related note:
The ideal subject was completely vegan, had never put any effort or thought into their diet, and was extremely motivated to take a test and implement changes.
I don’t understand why the target subject here should be people who have never put any effort or thought into their diet. That way you don’t get relevant evidence about the prevalence of iron deficiency among veg*ns, but only the almost trivial conclusion that people who don’t take any care of their dietary health have some deficiencies.
The Alignment Problems
Parts 2 and 3 seem like you independently discovered the lack of performance metrics for decision theories.
Erik has already pointed out some problems in the math, but also
Formal definition: Same as for the attractor sequence, but for a positive Lyapunov coefficient.
I’m not sure this feels right. For the attractor sequence, it makes sense to think of the last part of the sequence as the attractor, that to which is arrived, and to think of the “structural properties incentivizing attraction” lying there. On the contrary, it would seem like the “structural properties incentivizing chaos” should be found at the start of the sequence (after which different paths wildly diverge), instead of in one of such divergent endings. Intuitively it seems like a sequence should be chaotic just when its Lyapunov exponent is high.
On another note, I wonder whether such a conceptualization of language generation as a dynamical system can be fruitful even for natural, non-AI linguistics.
I share this intuition that the solution as stated is underwhelming. But from my perspective that’s just because that key central piece is missing, and this wasn’t adequately communicated in the available public resources about PreDCA (even if it was stressed that it’s a work in progress). I guess this situation doesn’t look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn’t yet made public. Of course, while this is the case we should treat optimism with suspicion.
Also, let me note that my a priori understanding of the situation is not
let’s suppose amazing theory will solve imperfect search, and then tackle the other inner misalignment directly stemming from our protocol
but more like
given our protocol, we have good mathematical reasons to believe it will be very hard for an inner optimizer to arise without manipulating hypothesis update. We will use amazing theory to find a concrete learning setup and prove/conjecture that said manipulation is not possible (or that the probability is low). We then hope the remaining inner optimization problems are rare/few/weak enough as for other more straightforward methods to render them highly unlikely (like having the core computing unit explicitly reason about the risk of inner optimization).
Wow, this post is great!
Update: Vanessa addressed this concern.
The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!
Hi Vanessa! Thanks again for your previous answers. I’ve got one further concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don’t need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).
Of course, since the only way to change the AGI’s actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn’t need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don’t think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it’s better understood as an alignment failure.
The way I see PreDCA (and this might be where I’m wrong) is as an “outer top-level protocol” which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we’ve provided is clearly aligned, we’re safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that’s everything our AGI is really doing).
I don’t think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they’re running on etc.). But I think your Agreement solution doesn’t completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.
Given you want to “push the diagonal lemma around / hide it somewhere” and come up with something equivalent but with another shape (I share Nate’s intuitions), something like this paper’s appendix (§12) might be useful: they build the diagonal lemma directly into the Gödel numbering. This might allow for defining your desired proof by a formula and trivially obtaining existence (and your target audience won’t need to know weird stuff happened inside the Gödel numbering). I’ll try to work this out in the near future.
Super cool story I really enjoyed, thank you!
That said, the moral of the story would just be “anthropic measure is just whatever people think anthropic measure is”, right?
Nice post!
I do think there is a way in which this proposal for solving outer alignment “misses a hard bit”, or better said “presupposes outer alignment is approximately already solved”.
Indeed, the outer objective you propose isn’t completely specified. A key missing part the AGI would need to have internalized is “what human actions constitute evidence of which utility functions”. That is, you are incentivizing it to observe actions in order to reduce the space of possible s, but haven’t specified how to update on these observations. In your toy example with the switch X, of course we could somehow directly feed the into the AI (which amounts to solving the problem ourselves). But in reality we won’t be able to always do this: all small human actions will constitute some sort of evidence the AI needs to interpret.
So for this protocol to work, we already need the AI to correctly interpret any human action as information about their utility function. That is, we need to have solved IRL, and thus outer alignment.
But what if we input this under-specified outer objective and let the AI figure out on its own which actions constitute which evidence?
The problem is it’s not enough for the AI to factually know this, but to have it internalized as part of its objective, and thus we need the whole objective to be specified from the start. This is analogous to how it’s not enough for an AI to factually know “humans don’t want me to destroy the world”, we actually need its objective to care about this (maybe through a pointer to human values). Of course your proposal tries to construct that pointer, the problem is if you provide an under-specified pointer, the AI will fill it in (during training, or whatever) with random elements, pointing to random goals. That is, the AI won’t see “performing updates on in a sensible manner” as part of its goal, even if it eventually learns how to “perform updates on in a sensible manner” as an instrumental capability (because you didn’t specify that from the start).
But this seems like it’s assuming the AI is already completely capable when we give it an outer objective, which is not the case. What if we train it on this objective (providing the actual score at the end of each episode) in hopes that it learns to fill the objective in the correct way?
Not exactly. The argument above can be rephrased as “if you train it on an under-specified objective, it will almost surely (completely ignoring possible inner alignment failures) learn the wrong proxy for that objective”. That is, it will learn a way to fill in the “update the s” gap which scores perfectly on training but is not what we really would have wanted. That is, you somehow need to ensure it learns the correct proxy without us completely specifying it, which is again a hard part of the problem. Maybe there’s some practical reason (about the inductive biases of SGD etc.) to expect training with this outer objective to be more tractable than training with any rough approximation of our utility function (maybe because you can somehow vastly reduce the search space), but that’d be a whole different argument.
On another note, I do think this approach might be applicable as an outer shell to “soften the edges” when we already have an approximately correct solution to outer alignment (that is, when we already have a robust account of what constitutes “updating s”, and also a good enough approximation to our utility function to be confident that our pool of approximate s contains it). In that sense, this seems functionally very reminiscent of Turner’s low-impact measure AUP, which also basically “softens the edges” of an approximate solution by considering auxiliary s. That said, I don’t expect both your approaches to coincide on the limit, since his is basically averaging over some s, and yours is doing something very different, as explained in your “Limit case” section.
Please do let me know if I’ve misrepresented anything :)
Brute-forcing the universe: a non-standard shot at diamond alignment
Oh, so it seems we need a coarse grained user (a vague enough physical realization of the user) for threshold problems to arise. I understand now, thank you again!
I think you’re right, and I wasn’t taking this into account, and I don’t know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don’t even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial “hypotheses updating” code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren’t the final chosen hypothesis could also be part of a demon).
I’m not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn’t disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance (and there are some disruptive strategies very efficient for continued existence).
This might be a way in which PreDCA misses a hard bit of Alignment. More concretely, our problem is basically that the search space of possible AGI designs is too vast, and our search ability too limited. And PreDCA tries to reduce this space by considering a very concrete protocol which can be guaranteed to behave in certain ways. But maybe all (or most) of the vastness of the search space has been preserved, only now disguised as the search space over possible inner heuristics that can implement said protocol. Or put another way, whether or not the model implements simplifying heuristics or carries out a brute-force search, the space of possible hypotheses updates remains (approximately) as vast and problematic. Implementing heuristics approximately preserves this vastness: even if once the heuristic is implemented the search is considerably smaller, before that we already had to search over possible heuristics.
In fact, generalizing such arguments could be a piece in an argument that “abstracted perfect Alignment”, in the sense of “a water-tight solution that aligns with arbitrary goals agents of arbitrary capability (arbitrarily close-to-perfect consequentialist)”, is unsolvable. That is, if we abstract away all contextual contingencies that can make (even the strongest) AIs imperfect consequentialists, then (almost “by definition”) they will always outplay our schemes (because the search space is being conceptualized as unboundedly vast).
What prior over policies?
Some kind of simplicity prior, as mentioned here.
Suppose the prior over policies is max-entropy (uniform over all action sequences). If the number of “actions” is greater than the number of bits it takes to specify my brain[1], it seems like it would conclude that my utility function is something like “1 if {acts exactly like [insert exact copy of my brain] would}, else 0″.
Yes. In fact I’m not even sure we need your assumption about bits. Say policies are sequences of actions, and suppose at each time step we have actions available. Then, in our process of approximating your perfect/overfitted utility “1 if {acts exactly like [insert exact copy of my brain] would}, else 0”, adding one more specified action to our can be understood as adding one more symbol to its generating program, and so incrementing by 1. But also, adding one more (perfect) specified action multiplies the denominator probability by (since the prior is uniform). So as long as , will be unbounded when approximating your utility.
And of course, this is solved by the simplicity prior, because this makes it easier for simple s to achieve low denominator probability. So a way simpler (less overfitted to *) will achieve almost the same low denominator probability as your function, because the only policies that maximize better than * are too complex.
That’s a pity to hear, since undoubtedly any dietary change or improvement does require some thought (and we also should think about our dietary health regardless of them). That said, I do generally feel the required effort, thought, and money are way less than mainstream opinion usually pictures.
Regarding effort and thought, my experience (and that of all almost all vegans I’ve known who didn’t suffer of already-present unrelated health issues) was that the change did required some effort and thought the first months (getting used to cooking ~15 nutritious comfort vegan meals, checking the labels, possibly visiting your nutritionist for the first time), but it quickly became a habit, as customary as following an omnivore diet.
And regarding money, the big bulk of a healthy vegan diet aren’t meat substitutes, processed burgers etc., but vegetables, legumes, cereals… some of the cheapest products you can buy, especially when compared to meat. So a healthy vegan diet is indeed usually cheaper than an omnivore diet, even including the B12 supplementation, which is really cheap, and maybe even including the sporadic nutritionist visit and blood testing (in case they’re not already part of your public health/medical plan), unless doctor visits are absurdly expensive in your country.