Without having read the transcript either, this sounds like it’s focused on near-term issues with autonomous weapons, and not meant to be a statement about the longer-term role autonomous weapons systems might play in increasing X-risk.
autonomous weapons are unlikely to directly contribute to existential risk
I disagree, and would’ve liked to see this argued for.
Perhaps the disagreement is at least somewhat about what we mean by “directly contribute”.
Autonomous weapons seem like one of the areas where competition is most likely to drive actors to sacrifice existential safety for performance. This is because the stakes are extremely high, quick response time seems very valuable (meaning having a human in the loop becomes costly) and international agreements around safety seem hard to imagine without massive geopolitical changes.
OK, so no “backwards causation” ? (not sure if that’s a technical term and/or if I’m using it right...)
Is there a word we could use instead of “linear”, which to an ML person sounds like “as in linear algebra”?
What is “linear interactive causality”?
Sold out on the website. Any ideas where else to get one?
Otherwise I guess masks are a decent substitute (they don’t need to be P95 for this purpose...)
The TL;DR seems to be: “We only need a lower bound on the catastrophe/reasonable impact ratio, and an idea about how much utility is available for reasonable plans.”
This seems good… can you confirm my understanding below is correct?
2) RE: “How much utility is available”: I guess we can just set a targeted level of utility gain, and it won’t matter if there are plans we’d consider reasonable that would exceed that level? (e.g. “I’d be happy if we can make 50% more paperclips at the same cost in the next year.”)
1) RE: “A lower bound”: this seems good because we don’t need to know how extreme catastrophes could be, we can just say: “If (e.g.) the earth or the human species ceased to exist as we know it within the year, that would be catastrophic”.
I agree. While interesting, the contents and title of this post seem pretty mismatched.
I generally don’t read links when there’s no context provided, and think it’s almost always worth it (from a cooperative perspective) to provide a bit of context.
Can you give me a TL;DR of why this is relevant or what your point is in posting this link?
If you have access to the training data, then DNNs are basically theory simulatable, since you can just describe the training algorithm and the initialization scheme. The use of random initialization seems like an obstacle, but we use pseudo-random numbers and can just learn the algorithms for generating those as well.
I’m not sure it’s the same thing as alignment… it seems there’s at least 3 concepts here, and Hjalmar is talking about the 2nd, which is importantly different from the 1st:
“classic notion of alignment”: The AI has the correct goal (represented internally, e.g. as a reward function)
“CIRL notion of alignment”: AI has a pointer to the correct goal (but the goal is represented externally, e.g. in a human partner’s mind)
“corrigibility”: something else
What do you mean “these things”?
Also, to clarify, when you say “not going to be useful for alignment”, do you mean something like ”...for alignment of arbitrarily capable systems”? i.e. do you think they could be useful for aligning systems that aren’t too much smarter than humans?
So IIUC, you’re advocating trying to operate on beliefs rather than utility functions? But I don’t understand why.
We could instead verify that the model optimizes its objective while penalizing itself for becoming more able to optimize its objective.
As phrased, this sounds like it would require correctly (or at least conservatively) tuning the trade-off between these two goals, which might be difficult.
One thing I found confusing about this post + Paul’s post “Worst-case guarantees” (2nd link in the OP: https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d) is that Paul says “This is the second guarantee from “Two guarantees,” and is basically corrigibility.” But you say: “Corrigibility seems to be one of the most promising candidates for such an acceptability condition”. So it seems like you guys might have somewhat different ideas about what corrigibility means.
Can you clarify what you think is the relationship?
Here’s a blog post arguing that conceptual analysis has been a complete failure, with a link to a paper saying the same thing: http://fakenous.net/?p=1130
Sure, but within AI, intelligence is the main feature that we’re trying very hard to increase in our systems that would plausibly let the systems we build outcompete us. We aren’t trying to make AI systems that replicate as fast as possible. So it seems like the main thing to be worried about is intelligence.
Blaise Agüera y Arcas gave a keynote at this NeurIPS pushing ALife (motivated by specification problems, weirdly enough...: https://neurips.cc/Conferences/2019/Schedule?showEvent=15487).
The talk recording: https://slideslive.com/38921748/social-intelligence. I recommend it.
With 0, the AI never does anything and so is basically a rock
I’m trying to point at “myopic RL”, which does, in fact, do things.
You might object that all of these can be made state-dependent, but you can make your example state-dependent by including the current time in the state.
I do object, and still object, since I don’t think we can realistically include the current time in the state. What we can include is: an impression of what the current time is, based on past and current observations. There’s an epistemic/indexical problem here you’re ignoring.
I’m not an expert on AIXI, but my impression from talking to AIXI researchers and looking at their papers is: finite-horizon variants of AIXI have this “problem” of time-inconsistent preferences, despite conditioning on the entire history (which basically provides an encoding of time). So I think the problem I’m referring to exists regardless.
Can’t I say that the emulation is me, and does morally matter (via FCToM), and also the many people enslaved in the system morally matter (via regular morality)?
What I’m saying is that, to use the language of the debate I referenced, “what kind of paper the equation is written on DOES matter”.
It seems like you’re saying that FCToM implies that if a physical system implements a morally relevant mathematical function, then the physical system itself cannot include morally relevant bits, and I don’t see why that has anything to do with FCToM.
I’m saying “naive FCToM”, as I’ve characterized it, says that. I doubt “naive FCToM” is even coherent. That’s sort of part of my broader point (which I didn’t make yet in this post).
I think I was maybe trying to convey too much of my high-level views here. What’s maybe more relevant and persuasive here is this line of thought:
Intelligence is very multi-faceted
An AI that is super-intelligent in a large number (but small fraction) of the facets of intelligence could strategically outmanuver humans
Returning to the original point: such as AI could also be significantly less “rational” than humans
Also, nitpicking a bit: to a large extent, society is trying to make systems that are as competitive as possible at narrow, profitable tasks. There are incentives for excellence in many domains. FWIW, I’m somewhat concerned about replicators in practice, e.g. because I think open-ended AI systems operating in the real-world might create replicators accidentally/indifferently, and we might not notice fast enough.
My main opposition to this is that it’s not actionable
I think the main take-away from these concerns is to realize that there are extra risk factors that are hard to anticipate and for which we might not have good detection mechanisms. This should increase pessimism/paranoia, especially (IMO) regarding “benign” systems.
Idk, if it’s superintelligent, that system sounds both rational and competently goal-directed to me.
(non-hypothetical Q): What about if it has a horizon of 10^-8s? Or 0?
I’m leaning on “we’re confused about what rationality means” here, and specifically, I believe time-inconsistent preferences are something that many would say seem irrational (prima face). But