# scasper(Stephen Casper)

I think my reply to this is essentially the same as my reply to this comment on the EA forum.

https://forum.effectivealtruism.org/posts/Bnp9YDqErNXHmTvvE/the-slippery-slope-from-dalle-2-to-deepfake-anarchy?commentId=NFAigaxwKzMRLNp9H

Did you read the post?

I think it is clear that we are not advocating for centralized authority. All of the three points in “takeaways” lead into this. The questions you asked in the second paragraph are ones that we discuss in the “what do we want section” with some additional stuff in the EA forum version of the post.

Without falling into the trap of debating over definitions. Anarchy can be used colloquially to refer to chaos in general and it was intended here to mean a lack of barriers to misuse in the regulatory and dev ecosystems—not a lack of someone’s monopoly on something. If you are in favor of people not monopolizing capabilities, I’m sure you would agree with out third “takeaways” point.

The “what do we want” section is about solutions that don’t involve banning anything. We don’t advocate for banning anything. The “banning” comments are strawpersoning.

The Vice article came out on August 24th. That was 5 days after the SD leak and 2 days after its official open-source release. The claim that it made that SD couldn’t “begin to trick anyone into thinking they’re real snapshots of nudes” did not stand the test of time. We linked the vice article in context of the discussion of deepfake porn in general, not on the specific photorealistic capabilities of SD.

Speaking of which, dreambooth does allow for this. See this SFW example. This is the type of thing would not be possible with older methods. https://www.reddit.com/r/StableDiffusion/comments/y1xgx0/dreambooth_completely_blows_my_mind_first_attempt/

And bear in mind that new updates, GUIs, APIs, capabilities, etc are arriving almost daily.

I will not link NSFW examples. But I have seen them. They are just as realistic. Others have agreed. I’ve gotten several people banned from social media platforms after reporting them.

# The Slippery Slope from DALLE-2 to Deepfake Anarchy

I think I agree with johnswentworth’s comment. I think there is a chance that equating genuinely useful ASI safety-related work with deceptive alignment could be harmful. To give another perspective, I would also add that I think your definition of deceptive alignment is very broad—broad enough to encompass areas of research that I find quite distinct (e.g. better training methods vs. better remediation methods) -- yet still seems to exclude some other things that I think matter a lot for AGI safety. Some quick examples I thought of in a few minutes are:

Algorithms that fail less in general and are more likely to pass the safety-relevant tests we DO throw at it. This seems like a very broad category.

Containment

Resolving foundational questions about embedded agents, rationality, paradoxes, etc.

Forecasting

Governance, policy, law

Working on near term problems like ones involving fairness/bias, recommender systems, and self driving cars. These problems are often thought of a separate from AGI safety, but it seems highly unlikely that we would not be able to learn useful lessons for later by working on these problems now.

The taxonomy we introduced in the survey gave a helpful way of splitting up the problem. Other than that, it took a lot of effort, several google docs that got very messy, and https://www.connectedpapers.com/. Personally, I’ve also been working on interpretability for a while and have passively formed a mental model of the space.

My answer to this is actually tucked into one paragraph on the 10th page of the paper: “This type of approach is valuable...reverse engineering a system”. We cite examples of papers that have used interpretability tools to generate novel adversaries, aid in manually-finetuning a network to induce a predictable change, or reverse engineer a network. Here they are.

Making adversaries:

https://distill.pub/2019/activation-atlas/

https://arxiv.org/abs/2110.03605

https://arxiv.org/abs/1811.12231

https://arxiv.org/abs/2201.11114

https://arxiv.org/abs/2206.14754

https://arxiv.org/abs/2106.03805

https://arxiv.org/abs/2006.14032

https://arxiv.org/abs/2208.08831

https://arxiv.org/abs/2205.01663

Manual fine-tuning:

https://arxiv.org/abs/2202.05262

https://arxiv.org/abs/2105.04857

Reverse engineering (I’d put an asterisk on these ones though because I don’t expect methods like this to scale well to non-toy problems):

https://distill.pub/2020/circuits/curve-detectors/

# [Linkpost] A survey on over 300 works about interpretability in deep networks

Don’t know what part of the post you’re referring to.

In both cases, it can block the Lobian proofs. But something about this is unsatisfying about making ad-hoc adjustments to one’s policy like this. I’ll quote Demski on this instead of trying to write my own explanation. Demski writes

Secondly, an agent could reason logically but with some looseness. This can fortuitously block the Troll Bridge proof. However, the approach seems worryingly unprincipled, because we can “improve” the epistemics by tightening the relationship to logic, and get a decision-theoretically much worse result.

The problem here is that we have some epistemic principles which suggest tightening up is good (it’s free money; the looser relationship doesn’t lose much, but it’s a dead-weight loss), and no epistemic principles pointing the other way. So it feels like an unprincipled exception: “being less dutch-bookable is generally better, but hang loose in this one case, would you?”

Naturally, this approach is still very interesting, and could be pursued further—especially if we could give a more principled reason to keep the observance of logic loose in this particular case. But this isn’t the direction this document will propose. (Although you

*could*think of the proposals here as giving more principled reasons to let the relationship with logic be loose, sort of.)So here, we will be interested in solutions which “solve troll bridge” in the stronger sense of getting it right while fully respecting logic. IE, updating to probability 1 (/0) when something is proven (/refuted).

Any chance you could clarify?

In the troll bridge problem, the counterfactual (the agent crossing the bridge) would indicate the inconsistency of the agent’s logical system of reasoning. See this post and what demski calls a subjective theory of counterfactuals.

in your terms an “object” view and an “agent” view.

Yes, I think that there is a time and place for these two stances toward agents. The object stance when we are thinking about how behavior is deterministic conditioned on a state of the world and agent. The agent stance for when we are trying to be purposive and think about what types of agents to be/design. If we never wanted to take the object stance, we couldn’t successfully understand many dilemmas, and if we never wanted to take the agent stance, then there seems little point in trying to talk about what any agent ever “should” do.

There’s a sense in which this is self-defeating b/c if CDT implies that you should pre-commit to FDT, then why do you care what CDT recommends as it appears to have undermined itself?

I don’t especially care.

counterfactuals only make sense from within themselves

Is naive thinking about the troll bridge problem a counterexample to this? There, the counterfactual stems from a contradiction.

CDT doesn’t recommend itself, but FDT does, so this process leads us to replace our initial starting assumption of CDT with FDT.

I think that no general type of decision theory worth two cents always does recommend itself. Any decision theory X that isn’t silly would recommend replacing itself before entering a mind-policing environment in which the mind police punishes an agent iff they use X.

Thanks, the second bit you quoted, I rewrote. I agree that sketching the proof that way was not good.

Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross.

This should be more clear and not imply that rob needs to be able to prove his own consistency. I hope that helps.

Here’s the new version of the paragraph with my mistaken explanation fixed.

“Suppose that hypothetically, Rob proves that crossing the bridge would lead to it blowing up. Then if he crossed, he would be inconsistent. And if so, the troll would blow up the bridge. So Rob can prove that a proof that crossing would result in the bridge blowing up would mean that crossing would result in the bridge blowing up. So Rob would conclude that he should not cross. ”

Thanks for the comment. tl;dr, I think you mixed up some things I said, and interpreted others in a different way than I intended. But either way, I don’t think there are “enormous problems”.

So the statement to be proven (which I shall call P) is not just “agent takes action X”, but “when presented with this specific proof of P, the agent takes action X”.

Remember that I intentionally give a simplified sketch of the proof instead of providing it. If I did, I would specify the provability predicate. I think you’re conflating what I say about the proof and what I say about the agent. Here, I say that our model agent who is vulnerable to spurious proofs would obey a proof that it would take X if presented. Demski explains things the same way. I don’t say that’s the definition of the provability predicate here. In this case, an agent being willing to accede proofs in general that it will take X is indeed sufficient for being vulnerable to spurious proofs.

Second is that in order for Löb’s theorem to have any useful meaning in this context, the agent must be consistent and

*able to prove its own consistency*, which it cannot do by Gödel’s second incompleteness theorem.I don’t know where you’re getting this from. It would be helpful if you mentioned where. I definitely don’t say anywhere that Rob must prove his own consistency, and neither of the two types of proofs I sketch out assume this either. you might be focusing on a bit that I wrote: “So assuming the consistency of his logical system...” I’ll edit this explanation for clarity. I don’t intend that Rob be able to prove the consistency, but that if he proved crossing would make it blow up, that would imply crossing would make it blow up.

As presented, it is given a statement P (which could be anything), and asked to verify that “Prov(P) → P” for use in Löb’s theorem.While the post claims that this is obvious, it is absolutely

*not*.I don’t know where you’re getting this from either. In the “This is not trivial...” paragraph I explicitly talk about the difference between statements, proofs, and provability predicates. I think you have some confusion about what I’m saying either due to skimming or to how I have the word “hypothetically” do a lot of work in my explanation of this (arguably too much). But I definitely do not claim that “Prov(P) → P”.

No.

The hypothetical pudding matters too!

My best answer here is in the form of this paper that I wrote which talks about these dilemmas and a number of others. Decision theoretic flaws like the ones here are examples of subtle flaws in seemingly-reasonable frameworks for making decisions that may lead to unexpected failures in niche situations. For agents who are either vulnerable to spurious proofs or trolls, there are adversarial situations that could effectively exploit these weaknesses. These issues aren’t tied to incompleteness so much as they are just examples of ways that agents could be manipulable.

The importance of this question doesn’t involve whether or not there is an “option” in the first case or what you can or can’t do in the second. What matters is whether,

**hypothetically**, you would always obey such a proof or would potentially disobey one. The hypothetical here matters independently of what actually happens because the hypothetical commitments of an agent can potentially be used in cases like this to prove things about it via Lob’s theorem.

Another type of in which how an agent would hypothetically behave can have real influence on its actual circumstances is Newcomb’s problem. See this post.

I think this seems really cool. I’m excited about this. The kind of thing that I would hope to see next is a demonstration that this method can be useful for modifying the transformer in a way that induces a predictable change in the network’s behavior. For example, if you identify a certain type of behavior like toxicity or discussion of certain topics, can you use these interpretations to guide updates to the weights of the model that cause it to no longer say these types of things according to a classifier for them?