idk, sounds dangerously close to deferences
TsviBT(Tsvi Benson-Tilsen)
honestly i prefer undonfrences
Yeah I think there’s a miscommunication. We could try having a phone call.
A guess at the situation is that I’m responding to two separate things. One is the story here:
One mainstay of claiming alignment is near-impossible is the difficulty of “solving ethics”—identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect—this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
It does simplify the problem, but not massively relative to the whole problem. A harder part shows up in the task of having a thing that
is capable enough to do things that would help humans a lot, like a lot a lot, whether or not it actually does those things, and
doesn’t
kill everyonedestroy approximately all human value.
And I’m not pulling a trick on you where I say that X is the hard part, and then you realize that actually we don’t have to do X, and then I say “Oh wait actually Y is the hard part”. Here is a quote from “Coherent Extrapolated Volition”, Yudkowsky 2004 https://intelligence.org/files/CEV.pdf:
Solving the technical problems required to maintain a well-specified abstract invariant in a self-modifying goal system. (Interestingly, this problem is relatively straightforward from a theoretical standpoint.)
Choosing something nice to do with the AI. This is about midway in theoretical hairiness between problems 1 and 3.
Designing a framework for an abstract invariant that doesn’t automatically wipe out the human species. This is the hard part.
I realize now that I don’t know whether or not you view IF as trying to address this problem.
The other thing I’m responding to is:
the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
If the AGI can (relevantly) act as a collaborator in improving its alignment, it’s already a creative intelligence on par with humanity. Which means there was already something that made a creative intelligence on par with humanity. Which is probably fast, ongoing, and nearly inextricable from the mere operation of the AGI.
I also now realize that I don’t know how much of a crux for you the claim that you made is.
I personally have updated a fair amount over time on
people (going on) expressing invalid reasoning for their beliefs about timelines and alignment;
people (going on) expressing beliefs about timelines and alignment that seemed relatively more explicable via explanations other than “they have some good reason to believe this that I don’t know about”;
other people’s alignment hopes and mental strategies have more visible flaws and visible doomednesses;
other people mostly don’t seem to cumulatively integrate the doomednesses of their approaches into their mental landscape as guiding elements;
my own attempts to do so fail in a different way, namely that I’m too dumb to move effectively in the resulting modified landscape.
We can back out predictions of my personal models from this, such as “we will continue to not have a clear theory of alignment” or “there will continue to be consensus views that aren’t supported by reasoning that’s solid enough that it ought to produce that consensus if everyone is being reasonable”.
That’s another main possibility. I don’t buy the reasoning in general though—integrity is just super valuable. (Separately I’m aware of projects that are very important and neglected (legibly so) without being funded, so I don’t overall believe that there are a bunch of people strategically capitulating to anti-integrity systems in order to fund key projects.) Anyway, my main interest here is to say that there is a real, large-scale, ongoing problem(s) with the social world, which increases X-risk; it would be good for some people to think clearly about that; and it’s not good to be satisfied with false / vague / superficial stories about what’s happening.
I’m interpreting “realize” colloquially, as in, “be aware of”. I don’t think the people discussed in the post just haven’t had it occur to them that pre-singularity wealth doesn’t matter because a win singularity society very likely wouldn’t care much about it. Instead someone might, for example...
...care a lot about their and their people’s lives in the next few decades.
...view it as being the case that [wealth mattering] is dependent on human coordination, and not trust others to coordinate like that. (In other words: the “stakeholders” would have to all agree to cede de facto power from themselves, to humanity.)
...not agree that humanity will or should treat wealth as not mattering; and instead intend to pursue a wealthy and powerful position mid-singularity, with the expectation of this strategy having large payoffs.
...be in some sort of mindbroken state (in the genre of Moral Mazes), such that they aren’t really (say, in higher-order derivatives) modeling the connection between actions and long-term outcomes, and instead are, I don’t know, doing something else, maybe involving arbitrary obeisance to power.
I don’t know what’s up with people, but I think it’s potentially important to understand deeply what’s up with people, without making whatever assumption goes into thinking that IF someone only became aware of this vision of the future, THEN they would adopt it.
(If Tammy responded that “realize” was supposed to mean the etymonic sense of “making real” then I’d have to concede.)
the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
Why do you think you can get to a state where the AGI is materially helping to solve extremely difficult problems (not extremely difficult like chess, extremely difficult like inventing language before you have language), and also the AGI got there due to some process that doesn’t also immediately cause there to be a much smarter AGI? https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
IDK if there’s political support that would be helpful and that could be affected by people saying things to their representatives. But if so, then it would be helpful to have a short, clear, on-point letter that people can adapt to send to their representatives. Things I’d want to see in such a letter:
AGI, if created, would destroy all or nearly all human value.
We aren’t remotely on track to solving the technical problems that would need to be solved in order to build AGI without destroying all or nearly all human value.
Many researchers say they are trying to build AGI and/or doing research that materially contributes toward building AGI. None of those researchers has a plausible plan for making AGI that doesn’t destroy all or nearly all human value.
As your constituent, I don’t want all or nearly all human value to be destroyed.
Please start learning about this so that you can lend your political weight to proposals that would address existential risk from AGI.
This is more important to me than all other risks about AI combined.
Or something.
I wish you would realize that whatever we’re looking at, it isn’t people not realizing this.
Look… Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn’t help with the right track and isn’t on track to get on the right track or to help with the right track.
Ok, so I’m telling you that this hypothetically possible situation seems to me like the reality. And then you’re, I don’t know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it’s not helpful to try to say that some big swath of research is doomed? If this is what’s happening, then I think that what you in particular are doing here is a bad thing to do here.
Maybe we can have a phone call if you’d like to discuss further.
Doomed to irrelevance, or doomed to not being a complete solution in and of itself?
Doomed to not be trying to go to and then climb the mountain.
my brain is a dirty lying liar that lies to me at every opportunity
So then it isn’t easy. But it’s feedback. Also there’s not that much distinction between making a philosophically rigorous argument and “doing introspection” in the sense I mean, so if you think the former is feasible, work from there.
Is there a particular reason you expect there to be exactly one hard part of the problem,
Have you stopped beating your wife? I say “the” here in the sense of like “the problem of climbing that mountain over there”. If you’re far away, it makes sense to talk about “the (thing over there)”, even if, when you’re up close, there’s multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.
and for the part that ends up being hardest in the end to be the part that looks hardest to us now?
We make an argument like “any solution would have to address X” or “anything with feature Y does not do Z” or “property W is impossible”, and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It’s not like pointing to a little ball in ideaspace and being like “the answer is somewhere in here”. Rather it’s like cutting out a halfspace and saying “everything on this side of this plane is doomed, we’d have to be somewhere in the other half”, or like pointing out a manifold that all research is on and saying “anything on this manifold is doomed, we’d have to figure out how to move somewhat orthogonalward”.
research that stemmed from someone trying something extremely simple and getting an unexpected result
I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don’t care if you have an army of people who all agree on taking a stance that seems to imply that there’s not much relevant difference between LLMs and future AGI systems that might kill everyone.)
What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?
I think you (and everyone else) don’t know how to ask this question properly. For example, “on whether your theory describes the world as it is” is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.
To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don’t mean meditation, I mean thinking and also thinking about thinking/wanting/acting—aka some kinds of philosophy and math.)
I broadly agree with this. (And David was like .7 out of the 1.5 profs on the list who I guessed might genuinely want to grant the needed freedom.)
I do think that people might do good related work in math (specifically, probability/information theory, logic, etc.--stuff about formalized reasoning), philosophy (of mind), and possibly in other places such as theoretical linguistics. But this would require that the academic context is conducive to good novel work in the field, which lower bar is probably far from universally met; and would require the researcher to have good taste. And this is “related” in the sense of “might write a paper which leads to another paper which would be cited by [the alignment textbook from the future] for proofs/analogies/evidence about minds”.
I don’t speak for MIRI, but broadly I think MIRI thinks that roughly no existing research is hopeworthy, and that this isn’t likely to change soon. I think that, anyway.
In discussions like this one, I’m conditioning on something like “it’s worth it, these days, to directly try to solve AGI alignment”. That seems assumed in the post, seems assumed in lots of these discussions, seems assumed by lots of funders, and it’s why above I wrote “the main direct help we can give to AGI alignment” rather than something stronger like “the main help (simpliciter) we can give to AGI alignment” or “the main way we can decrease X-risk”.
If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one’s own thinking, discovering new ways of understanding one’s own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.
However, another dealbreaker problem with current and current-trajectory MI is that it isn’t studying minds.
From the section you linked:
Moreover, the program guarantees at least some mentorship from your supervisor. Your advisor’s incentives are reasonably aligned with yours: they get judged by your success in general, so want to see you publish well-recognized first-author research, land a top research job after graduation and generally make a name for yourself (and by extension, them).
Doing a PhD also pushes you to learn how to communicate with the broader ML research community. The “publish or perish″ imperative means you’ll get good at writing conference papers and defending your work.
These would be exactly the “anyone around them” about whose opinion they would have to not give a fuck.
I don’t know a good way to do this, but maybe a pointer would be: funders should explicitly state something to the effect of:
“The purpose of this PhD funding is to find new approaches to core problems in AGI alignment. Success in this goal can’t be judged by an existing academic structure (journals, conferences, peer-review, professors) because there does not exist such a structure aimed at the core problems in AGI alignment. You may if you wish make it a major goal of yours to produce output that is well-received by some group in academia, but be aware that this goal would be non-overlapping with the purpose of this PhD funding.”
The Vitalik fellowship says:
To be eligible, applicants should either be graduate students or be applying to PhD programs. Funding is conditional on being accepted to a PhD program, working on AI existential safety research, and having an advisor who can confirm to us that they will support the student’s work on AI existential safety research.
Despite being an extremely reasonable (even necessary) requirement, this is already a major problem according to me. The problem is that (IIUC—not sure) academics are incentivized to, basically, be dishonest, if it gets them funding for projects / students. Of the ~dozen professors here (https://futureoflife.org/about-us/our-people/ai-existential-safety-community/) who I’m at least a tiny bit familiar with, I think maybe 1.5ish are actually going to happily support actually-exploratory PhD students. I could be wrong about this though—curious for more data either way. And how many will successfully communicate to the sort of person who would take a real shot at exploratory conceptual research if given the opportunity to do such research that they would in fact support that? I don’t know. Zero? One? And how would someone sent to the FLI page know of the existence of that professor?
Fellows are expected to participate in annual workshops and other activities that will be organized to help them interact and network with other researchers in the field.
Continued funding is contingent on continued eligibility, demonstrated by submitting a brief (~1 page) progress report by July 1st of each year.
Again, reasonable, but… Needs more clarity on what is expected, and what is not expected.
a technical specification of the proposed research
What does this even mean? This webpage doesn’t get it. We’re trying to buy something that isn’t something someone can already write a technical specification of.
That would only work for people with the capacity to not give a fuck what anyone around them thinks, especially including the person funding and advising them. And that’s arguably unethical depending on context.
the doppelganger problem is a fairly standard criticism of the sparse autoencoder work,
And what’s the response to the criticism, or a/the hoped approach?
diasystemic novelty seems the kind of thing you’d encounter when doing developmental interpretability, interp-through-time
Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?
it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn’t focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.
Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).
Generally what you’re bringing up sounds like “yes these are problems and MI would like to think about them… later”. Which is understandable, but yeah, that’s what streetlighting looks like.
Maybe an implicit justification of current work is like:
There’s these more important, more difficult problems. We want to deal with them, but they are too hard right now, so we will try in the future. Right now we’ll deal with simpler things. By dealing with simpler things, we’ll build up knowledge, skills, tools, and surrounding/supporting orientation (e.g. explaining weird phenomena that are actually due to already-understandable stuff, so that later we don’t get distracted). This will make it easier to deal with the hard stuff in the future.
This makes a lot of sense—it’s both empathizandable, and seems probably somewhat true. However:
Again, it still isn’t in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
A core motivating intuition behind the MI program is (I think) “the stuff is all there, perfectly accessible programmatically, we just have to learn to read it”. This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations
Insightful
Learning
Implore
Agreed
Delta