# Vanessa Kosoy(Vanessa Kosoy)

Karma: 6,594

AI alignment researcher supported by HUJI, MIRI and LTFF. Working on the learning-theoretic agenda.

E-mail: vanessa DOT kosoy AT {the thing reverse stupidity is not} DOT org

• 7 Jan 2023 12:59 UTC
4 points
0 ∶ 0

The syntax means ” to the tensor power of ”. For , it just means tensoring with itself times. For , is just the trivial line bundle with total space (and, yes, all line bundles are isomorphic to the trivial line bundle, but this one just is the trivial bundle… or at least, canonically isomorphic to it). For , we need the notion of a dual vector bundle. Any vector bundle has a dual , and for a line bundle the dual is also the inverse, in the sense that is canonically isomorphic to the trivial bundle. We can then define all negative powers by . Notice that non-negative tensor powers are defined for all vector bundles, but negative tensor powers only make sense for line bundles.

It remains to explain what is . But, for our purposes we can take a shortcut. The idea is, for any finite-dimensional complex vector space with an inner product, there is a canonical isomorphism between and , where is the complex-conjugate space. What is the complex-conjugate space? It is a vector space that (i) has the same set of vectors (ii) has the same addition operation and (iii) has its multiplication-by-scalar operation modified, so that multiplying by in is the same thing as multiplying by in , where is just the complex number conjugate to .

Equipped with this observation, we can define the dual of a Hermitian line bundle to be , where is the bundle obtained for by changing its multiplication-by-scalar mapping in the obvious way.

• IMO it might very well be that most restrictions on data and compute are net positive. However, there are arguments in both directions.

On my model, current AI algorithms are missing some key ingredients for AGI, but they might still eventually produce AGI by learning those missing ingredients. This is similar to how biological evolution a learning algorithm which is not a GI, but it produced humans who are GIs. Such an AGI would be a mesa-optimizer, and it’s liable to be unaligned regardless of the details of the outer loop (assuming an outer loop made of building blocks similar to what we have today). For example, the outer loop might be aimed at a human imitation, but the resulting mesa-optimizer is only imitating humans when it’s instrumentally beneficial for it. Moreover, as in the case of evolution, this process would probably be very costly in terms of compute and data, as it is trying to “brute force” a problem for which it doesn’t have an efficient algorithm. Therefore, limiting compute or data seems like a promising way to prevent this undesirable scenario.

On the other hand, the most likely path to aligned AI would be through a design that’s based on solid theoretical principles. Will such a design require much data or compute compared to unaligned competitors?

Reasons to think it won’t:

• Solid theoretical principles should allow improve capabilities as well as alignment.

• Intuitively, if an AI is capable enough to be transformative (given access to particular amounts of compute and data), it should be capable enough to figure out human values, assuming it is motivated to do so at the first place. Or, it should at least be capable enough to act against unaligned competition while not irreversibly destroying information about human values (in which case it can catch up on learning those later). This is similar to what Christiano calls “strategy stealing”.

Reasons to think it will:

• Maybe aligning AI requires installing safe-guards that cause substantial overhead. This seems very plausible when looking at proposals such as Delegative Reiforcement Learning, which have worst regret asymptotic that “unaligned” alternatives (conventional RL). It also seems plausible when looking at proposals such as IDA or debate, which introduce another level of indirection (simulating humans) to the problem of optimizing the world that unaligned AI attacks directly (in Christiano’s terminology, they fail to exploit inaccessible information). It’s less clear about PreDCA, but even there alignment requires a loss function with more complex type signature than the infra-Bayesian physicalism “default”, which might incur a statistical or computational penalty.

• Maybe aligning AI requires restricting ourselves to using well-understood algorithmic building blocks and not heuristic (but possibly more efficient) building blocks. Optimistically, having solid theoretic principles should allow us to roughly predict the behavior even of heuristic algorithms that are effective (because such algorithms have to be doing qualitatively the same thing as the rigorous algorithms). Pessimistically, alignment might depend on nuances that are obscured in heuristics.

We can model the situation by imagining 3 frontiers in resource space:

• The mesa-resource-frontier (MRF) is how much resources are needed to create TAI with something similar to modern algorithms, i.e. while still missing key AGI ingredients (which is necessarily unaligned).

• The direct-resource-frontier (DRF) is how much resources are needed to create TAI assuming all key algorithms, but without any attempt at alignment.

• The aligned-resource-frontier (ARF) is how much resources are needed to create aligned TAI.

We have ARF > DRF and MRF > DRF, but the relation between ARF and MRF is not clear. They might even intersect (resource space is multidimensional, we at least have data vs compute and maybe finer distinctions are important). I would still guess MRF > ARF, by and large. Assuming MRF > ARF > DRF, the ideal policy would forbid resources beyond MRF but allow resources beyond ARF. A policy that is too lax might lead to doom by the mesa-optimizer pathway. A policy that is too strict might lead to doom by making alignment infeasible. If the policy is so strict that it forces us below DRF then it buys time (which is good), but if the restrictions are then lifted gradually, it predictably leads to the region between DRF and ARF (which is bad).

Overall, the conclusion is uncertain.

• 24 Dec 2022 18:27 UTC
4 points
0 ∶ 0

There are two operations involved in the definition of : pullback and tensor product.

Pullback is defined for arbitrary bundles. Given a mapping (these and are arbitrary manifolds, not the specific ones from before) and a bundle over with total space and projection mapping , the pullback of w.r.t. (denoted ) is the bundle over with total space and the obvious projection mapping. I remind that is the fibre product, i.e. the submanifold of defined by . Notice that the fibre of over any is canonically isomorphic to the fibre of over . The word “canonical” means that there is a particular isomorphism that we obtain from the construction.

It is easy enough to see that the pullback of a vector bundle is a vector bundle, the pullback of a line bundle is a line bundle, and the pullback of a Hermitian vector bundle is a Hermitian vector bundle.

Tensor product is an operation over vector bundles. There are different ways to define it, corresponding to the different ways to define a tensor product of vector spaces. Specifically for line bundles there is the following shortcut definition. Let and be line bundles over . Then, the total space of is the quotient of by the equivalence relation given by: iff . Here, I regard as vectors in the vector space which is the corresponding fibre fo and similarly for and . The quotient of a manifold by an equivalence relation is not always a manifold, but in this case it is.

I notice that you wrote “a particular fiber is isomorphic to ”. Your error here is, it doesn’t matter what it’s isomorphic to, you should still think of it as an abstract vector space. So, if e.g. and are 1-dimensional vector spaces, then is yet another “new” vector space. Yes, they are all isomorphic, but they are not canonically isomorphic.

• 22 Dec 2022 12:00 UTC
6 points
0 ∶ 0

Your guess is exactly what I meant. The is outside the product, otherwise this expression is not even a valid group action.

As you said, a bundle over a manifold is another manifold with a projection s.t. locally it looks like a product. Formally, every should have an open neighborhood s.t. there is a diffeomorphism between restricted to and a projection for some manifold (the “fiber”).

A vector bundle is a bundle equipped with additional structure that makes every fiber a vector space. Formally, we need to have a smooth addition mapping and a multiplication-by-scalar mapping which are (i) morphisims of bundles and (ii) make every fiber (i.e. the inverse -image of every point in ) into a vector space. Here, stands for the fibre product (the submanifold of given by ). I’m using here because we will need complex vector bundles.

A line bundle is just a vector bundle s.t. every fiber is 1-dimensional.

A Hermitian vector bundle is a vector bundle equipped with a smooth mapping of bundles which makes every fibre into an inner product space.

Onward to quantum mechanics. Let be physical space and physical spacetime. In the non-relativistic setting, is isomorphic to , so all Hermitian line bundles over are isomorphic. So, in principle any one of them can be identified with the trivial bundle: total space with being the canonical projection. However, it is often better to imagine some Heremitian line bundle without such an identification. In fact, choosing an identification precisely corresponds to choosing a gauge. This is like how all finite dimensional real vector spaces are isomorphic to but it is often better not to fix a particular isomorphism (basis), because that obscures the underlying symmetry group of the problem. For finite dimensional vector spaces, the symmetry group is the automorphisms of the vector space (a group isomorphic to ), for bundles it is the automorphism group of the bundle (= the group of gauge of transformations).

So, let’s fix a Hermitian line bundle on . This allows constructing a Hermitian line bundle on (where is the number of particles) using the equation I gave before. That equation involves the operations of tensor product and pullback-by-mapping for bundles. I can explain, but maybe you can guess how they are defined (just imagine what it should do to every fibre, and then there is only one reasonable way to “glue” it together). If we fix an isomorphism between and the trivial bundle over (=gauge) then it induces an isomorphism between and the trivial bundle over . In this picture, saying that is a section of amounts to saying it is a mapping which is compatible with the projection. The latter condition just means it is the identity on the component of the output, so all the information is in the component on the output, reducing it to a mapping .

This way, in every particular gauge the wavefunction is just a complex function, but there is a sense in which it is better to avoid fixing a gauge and think of the wavefunction as a section of the somewhat abstract bundle . Just like a vector in a finite dimensional vector space can be thought of as a column of numbers, but often it’s better to think of it as just an abstract vector.

• 21 Dec 2022 8:39 UTC
5 points
0 ∶ 0

You don’t need QFT here, gauge invariance is a thing even for non-relativistic quantum charged particles moving in a background electromagnetic field. The gauge transformation group consists of (sufficiently regular) functions . The transformation law of the -particle wavefunction is:

Here, is the electric charge of the -th particle, in units of positron charge.

In math-jargony terms, the wavefunction is a section of the line bundle

Here, is the projection to the position of the -th particle and is the “standard” line bundle on on which the electromagnetic field (the 4-potential , which is useful here even though the setting is non-relativistic) is a connection. has an induced connection, and the electromagnetic time-dependent Shroedinger equation is obtained from the ordinary time-dependent Shroedinger equation by replacing ordinary derivatives with covariant derivatives.

• Other people have noted that Solomonoff log-probability differs from Kolmogorov complexity only by a constant. But there’s another similar pair of objects I’m interested in, where I don’t know whether the analogous claim holds. Namely, in my original definition of the AIT intelligence measure, I used Kolmogorov complexity, because I implicitly assumed it’s the same as Solomonoff log-probability up to a constant. But Alex questioned this claim, which is why I switched to Solomonoff log-probability when writing about the physicalist version (see Definition 1.6 here). The crucial difference between this and the question in the OP is, we’re looking at programs selected by Solomonoff-expectation of something-to-do-with-their-ouput, rather than directly by their output (which places us on different spots on the computability ladder). If the two are different then I’m pretty sure Solomonoff log-probability is the correct one, but are they? I would be very interested to know.

• Personally, I feel that I want to be pretty as a goal in itself and in order to be attractive to straight men (and to other gynosexual people). I suspect women[1] have an evolved intrinsic desire to look pretty because in the ancestral environment it increased your genetic fitness to look pretty. To give an analogy, we eat both because (i) we are hungry and also food tastes good, and because (ii) we know we need to eat to survive (e.g. if you’re sick and have no appetite you sometimes force yourself to eat) and we need particular types of food to stay healthy. A single activity can be motivated by a mixture of terminal and instrumental goals.

1. ↩︎

And other genders too, but for women it’s more pronounced, on average.

• The post is still largely up-to-date. In the intervening year, I mostly worked on the theory of regret bounds for infra-Bayesian bandits, and haven’t made much progress on open problems in infra-Bayesian physicalism. On the other hand, I also haven’t found any new problems with the framework.

The strongest objection to this formalism is the apparent contradiction between the monotonicity principle and the sort of preferences humans have. While my thinking about this problem evolved a little, I am still at a spot where every solution I know requires biting a strange philosophical bullet. On the other hand, IBP is still my best guess about naturalized induction, and, more generally, about the conjectured “attractor submanifold” in the space of minds, i.e. the type of mind to which all sufficiently advanced minds eventually converge.

One important development that did happen is my invention of the PreDCA alignment protocol, which critically depends on IBP. I consider PreDCA to be the most promising direction I know at present to solving alignment, and an important (informal) demonstration of the potential of the IBP formalism.

• It should be . More generally, there is the notion of support from measure theory, which sometimes comes up, although in this post we only work with finite sets so it’s the same.

• First, the notation makes no sense. The prior is over hypotheses, each of which is an element of . is the notation used to denote a single hypothesis.

Second, having a prior just over doesn’t work since both the loss function and the counterfactuals depend on .

Third, the reason we don’t just start with a prior over , is because it’s important which prior we have. Arguably, the correct prior is the image of a simplicity prior over physicalist hypotheses by the bridge transform. But, come to think about it, it might be about the same as having a simplicity prior over , where each hypothesis is constrained to be invariant under the bridge transform (thanks to Proposition 2.8). So, maybe we can reformulate the framework to get rid of (but not of the bridge transform). Then again, finding the “ultimate prior” for general intelligence is a big open problem, and maybe in the end we will need to specify it with the help of .

Fourth, I wouldn’t say that is supposed to solve the ontology identification problem. The way IBP solves the ontology identification problem is by asserting that is the correct ontology. And then there are tricks how to translate between other ontologies and this ontology (which is what section 3 is about).

• 6 Dec 2022 9:06 UTC
LW: 36 AF: 20
0 ∶ 0
AF

deserves a little more credit than you give it. To interpret the claim correctly, we need to notice and are classes of decision problems, not classes of proof systems for decision problems. You demonstrate that for a fixed proof system it is possible that generating proofs is easier than verifying proofs. However, if we fix a decision problem and allow any valid (i.e. sound and complete) proof system, then verifying cannot be harder than generating. Indeed, let be some proof system and an algorithm for generating proofs (i.e. an algorithm that finds a proof if a proof exists and outputs “nope” otherwise). Then, we can construct another proof system , in which a “proof” is just the empty string and “verifying” a proof for problem instance consists of running and outputting “yes” if it found an -proof and “no” otherwise. Hence, verification in is no harder than generation in . Now, so far it’s just , which is trivial. The non-trivial part is: there exist problems for which verification is tractable (in some proof system) while generation is intractable (in any proof system). Arguably there are even many such problems (an informal claim).

• First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).

Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.

Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.

• 26 Nov 2022 12:02 UTC
LW: 6 AF: 4
0 ∶ 0
AF

I don’t think the argument on hacking relied on the ability to formally verify systems. Formally verified systems could potentially skew the balance of power to the defender side, but even if they don’t exist, I don’t think balance is completely skewed to the attacker.

My point was not about the defender/​attacker balance. My point was that even short-term goals can be difficult to specify, which undermines the notion that we can easily empower ourselves by short-term AI.

Of course we need to understand how to define “long term” and “short term” here. One way to think about this is the following: we can define various short-term metrics, which are evaluable using information in the short-term, and potentially correlated with long-term success. We would say that a strategy is purely long-term if it cannot be explained by making advances on any combination of these metrics.

Sort of. The correct way to make it more rigorous, IMO, is using tools from algorithmic information theory, like I suggested here.

• In order to appear as a side-comment, quotes should be an exact match, including formatting

This might be inconvenient for markdown editor users. Because, when you copy text into the markdown editor, it loses the formatting. It would be nice if either formatting was ignored for side-comment matching purposes, or if copying formatted text into markdown would automatically add the relevant tags (the latter would have other benefits as well).

• 23 Nov 2022 17:59 UTC
LW: 7 AF: 4
0 ∶ 0
AF

Thanks for the responses Boaz!

Our claim is that one can separate out components—there is the predictable component which is non stationary, but is best approximated with a relatively simple baseline, and the chaotic component, which over the long run is just noise.In general, highly complex rules are more sensitive to noise (in fact, there are theorems along these lines in the field of Analysis of Boolean Functions), and so in the long run, the simpler component will dominate the accuracy.

I will look into analysis of boolean functions, thank you. However, unless you want to make your claim more rigorous, it seems suspect to me.

In reality, there are processes happening simultaneously on many different timescales, from the microscopic to the cosmological. And, these processes are coupled, so that the current equilibrium of each process can be regarded as a control signal for the higher timescale processes. This means we can do long-term planning by starting from the long timescales and back-chaining to short timescales, like I began to formalize here.

So, while eventually the entire universe reaches an equilibrium state (a.k.a. heat-death), there is plenty of room for long-term planning before that.

Hacking is actually a fairly well-specified endeavor. People catalog, score, and classify security vulnerabilities. To hack would be to come up with a security vulnerability, and exploit code, which can be verified.

Yeeees, it does seem like hacking is an especially bad example. But even in this example, my position is quite defensible. Yes, theoretically you can formally specify the desired behavior of the code and verify that it always happens. But, there are two problems with that: First, for many realistic software system, the formal specification would require colossal effort. Second, the formal verification is only as good as the formal model. For example, if the attacker found a hardware exploit, while your model assumes idealized behavior for the hardware, the verification doesn’t help. And, it domains outside software the situation is much worse: how do you “verify” that your biological security measures are fool-proof, for example?

Also, you seem to be envisioning a long-term AI that is then fine-tuned on a short-term task, but how did it evolve these long-term goals in the first place?

When you’re selecting for success on a short-term goal you might inadvertently produce a long-term agent (which, on the training distribution, is viewing the short-term goal as instrumental for its own goals), just like how evolution was selecting for genetic fitness but ended up producing agents with many preferences unrelated to that. More speculatively, there might be systematic reasons for such agents to arise, for example if good performance in the real-world requires physicalist epistemology which comes with inherent “long-terminess”.

I would not say that there is no such thing as talent in being a CEO or presidents. I do however believe that the best leaders have been some combination of their particular characteristics and talents, and the situation they were in. Steve Jobs has led Apple to become the largest company in the world, but it is not clear that he is a “universal CEO” that would have done as good in any company (indeed he failed with NeXT).

This sounds like a story you can tell about anything. “Yes, such-and-such mathematician proved a really brilliant theorem A, but their effort to make progress in B didn’t amount to much.” Obviously, real-world performance depends on circumstances and not only on talent. This is doubly true in a competitive setting, where other similarly talented people are working against you. Nevertheless, a sufficiently large gap in talent can produce very lopsided outcomes.

Also, as Yafah points elsewhere here, for people to actually trust an AI with being the leader of a company or a country, it would need to not just be as good as humans or a little better, but better by a huge margin. In fact, most people’s initial suspicion is that AIs (or even humans that don’t look like them) is not “aligned” with their interests, and if you don’t convince them otherwise, their default would be to keep them from positions of power.

First, it is entirely possible the AI will be better by a huge margin, because like with most things, there’s no reason to believe evolution brought us anywhere near the theoretical optimum on this. (Yes, there was selective pressure, but no amount of selective pressure allowed evolution to invent spaceships, or nuclear reactors, or even the wheel.) Second, what if the AI poses as a human? Or, what if the AI uses a human as a front while pulling the strings behind the scenes? There will be no lack of volunteers to work as such a front, if in the short-term them it brings them wealth and status. Also, ironically, the more successful AI risk skeptics are at swaying public opinion, the easier the AIs job is and the weaker their argument becomes.

The main point is that we need to measure the powers of a system as a whole, not compare the powers of an individual human with an individual AI. Clearly, if you took a human, made their memory capacity 10 times bigger, and made their speed 10 times faster, then they could do more things. But we are comparing with the case that humans will be assisted with short-term AIs that would help them in all of the tasks that are memory and speed intensive.

Alright, I can see how the “universality” argument makes sense if you believe that “human + short-term AI = scaled-up human”. The part I doubt is that this equation holds for any easy-to-specify value of “short-term AI”.

• IIUC the thesis of this article rest on several interrelated claims:

1. Long-term planning is not useful because of chaos

2. Short-term AIs have no alignment problem

3. Among humans, skill is not important for leadership, beyond some point

4. Human brains have an advantage w.r.t. animals because of “universality”, and any further advantage can only come from scaling with resources.

I wish to address these claims one by one.

## Claim 1

This is an erroneous application of chaos theory IMO. The core observation of chaos theory is, that in many dynamical systems with compact phase space, any distribution converges (in the Kantorovich-Rubinstein sense) to a unique stationary distribution. This means that small measurement errors lead to large prediction errors, and in the limit no information from the initial condition remains.

However, real-world dynamical systems are often not compact in the relevant approximation. In particular, acquisition of resources and development of new technologies are not bounded from above on a relevant scale. Indeed, trends in GDP growth and technological progress continue over long time scales and haven’t converged, so far, to a stationary distribution. Ultimately, these quantities are also bounded for physical /​ information-theoretic /​ complexity-theoretic reasons, but since humanity is pretty far from saturating them, this leaves ample room for AI to have a long-term planning advantage over humanity.

## Claim 2

Although it is true that, for sufficiently short-term planning horizons, AIs have less incentives to produce unintended consequences, problems remain.

One problem is that some tasks are very difficult to specify. For example, suppose that a group of humans armed with short-term AIs is engaged in cyberwarfare against a long-term AI. Then, even if every important step in the conflict can be modeled as short-term optimization, specifying the correct short-term goal can be a non-trivial task (how do you define “to hack” or “to prevent from hacking”?) that humans can’t easily point their short-term AI towards.

Moreover, AIs trained on short-term objectives can still display long-term optimization out-of-distribution. This is because a long-term optimizer that is smart enough to distinguish between training and deployment can behave according to expectations during training while violating them as much as it wants when it’s either outside of training or the correcting outer loop is too slow to matter.

## Claim 3

This claim flies so much in the face of common sense (is there no such thing as business acumen? charisma? military genius?) that it needs a lot more supporting evidence IMO. The mere fact that IQs of e.g. CEOs are only moderately above average and not far above average only means that IQ stops to be a useful metric at that range, since beyond some point, different people have cognitive advantages in different domains. I think that, as scientists, we need to be careful of cavalierly dismissing the sort of skills we don’t have.

As to the skepticism of the authors about social manipulation, I think that anyone who studied history or politics can attest that social manipulation has been used, and continues to be used, with enormous effects. (Btw, I think it’s probably not that hard to separate a dog from a bone or child from a toy if you’re willing to e.g. be completely ruthless with intimidation.)

## Claim 4

While it might be true that there is a sense in which human brains are “qualitatively optimal”, this still leaves a lot of room for quantitative advantage, similar to how among two universal computers, one can be vastly more efficient than the other for practical purposes. As a more relevant analogy, we can think of two learning algorithms that learn the same class of hypotheses while still having a significant difference in computational and/​or sample efficiency. In the limit of infinite resources and data, both algorithms converge to the same results, but in practice one still has a big advantage over the other. While undoubtedly there are hard limits to virtually every performance metric, there is no reason to believe evolution brought human brains anywhere near those limits. Furthermore, even if “scaling with resources” is the only thing that matters, the ability of AI to scale might be vastly better than the ability of humans to scale because of communication bandwidth bottlenecks between humans, not to mention the limited trust humans have towards one another (as opposed to large distributed AI systems, or disparate AI systems that can formally verify each other’s trustworthiness).

• Personally, I sometimes have the opposite metacognitive concern: that I’m not freaking out enough about AI risk. The argument goes: if I don’t have a strong emotional response, doesn’t it mean I’m lying to myself about believing that AI risk is real? I even did a few exercises in which I tried to visualize either the doom or some symbolic representation of the doom in order to see whether it triggers emotion or, conversely, exposes some self-deception, something that rings fake. The mental state that triggered was interesting, more like a feeling of calm meditative sadness than panic. Ultimately, I think you’re right when you say, if something doesn’t threaten me on the timescale of minutes, it shouldn’t send me into fight-or-flight. And, it doesn’t.

I also tentatively agree that it feels like there’s something unhealthy in the panicky response to Yudkowsky’s recent proclamation of doom, and it might lead to muddled thinking. For example, it seems like everyone around here are becoming convinced of shorter and shorter timelines, without sufficient evidence IMO. But, I don’t know whether your diagnosis is correct. Most of the discourse about AI risk around here is not producing any real progress on the problem. But, occasionally it does. And I’m not sure whether the root of the problem is psychological/​memetic (as you claim) or just that it’s a difficult problem that only a few can meaningfully contribute to.

• 17 Nov 2022 6:53 UTC
3 points
0 ∶ 0
in reply to: Martín Soto’s comment

If the information takes a little longer to arrive, then the user will still be inside the threshold.

A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn’t register as an agent. To account for this, we might need to define a notion of “coarse grained agent” and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.

• 15 Nov 2022 9:21 UTC
3 points
0 ∶ 0
in reply to: Martín Soto’s comment

Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right?

No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.

...it would seem like no training procedure implementing PreDCA can be modified/​devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks… because of the variety of attacks and the vastness of the space of hypotheses.

The variety of attacks doesn’t imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I’m hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.

1. ↩︎

As long as the assumptions of the model hold, ofc. And, assuming some (highly likely) complexity-theoretic conjectures.

• 14 Nov 2022 7:51 UTC
4 points
0 ∶ 0