I have signed no contracts or agreements whose existence I cannot mention.
plex
Separating Prediction from Goal-Seeking
Two Skillsets You Need to Launch an Impactful AI Safety Project
Pre-registering an important prediction: fa9be9574af1db6423d660a768f62ee02a97e9760ee7dbace6dc8643c2201d9d
I think I’m pretty happy with my terms, using Agent in the DeepMind Discovering Agents sense and Subagent in the Multiagent Models of Mind sense. These feel like crisp underlying abstractions which have various forms, not various different forms of things conflated together. For Shard, yep, I think I like that term and that it captures something also fairly crisp.
I like to see these as consequences of different control/information structures. I kind of agree with the stuff on power seeking yet I also want to point out that if you’re in a company (a top down organisational structure) then you can ask yourself if an individual contributor is less useful than a manager? I think the IC might be less loadbearing on the direction from time to time yet that person can at the same time often say a lot about some very specific system that matters.
Yes, subprocesses absolutely impact the direction of the overall system, they are in fact spun up to do things the superagent could not do as easily usually.
Sometimes you can get more agency in another direction by getting options removed from you and so it matters what type of agency is being removed and sometimes it can be a good thing to be a fully managed agent! (E.g someone forces you to eat healthy so that you get more energy on average)
I think versions of this where your agency is actually in line with a restriction can be good, especially if you place restrictions on yourself (self-management), but that if you’re constantly chaffing against e.g. eating healthy you’ll have more problems than it’s worth in general.
I also think that the truer name version of this is something like a scalar property about a message passing relationship between two agents
Yup agree, it’s not a binary, just one useful angle to look at relationships between agents.
that it is not only top down control structures that matter, there are other forms of organisation such as markets, democracies, networks and communities as well.
Yes! These cases I would classify as being managed or selected for trust by a superagent.
There is for sure a throne of current goal orientation, I think it’s the GNW / current working memory. But, I’m pretty sure there is a bunch of subagents which make bids to be in that throne, with a huge amount of subconscious parallel processing. Much of what I’ve learned about and in therapy closely matches the excellently written Multiagent Models of Mind sequence, which is also super good as an intro to therapy and psychological healing.
No.
Okay, that is a position which there might be good arguments for, but that seems important to say loudly and clearly, both inside GDM and outside, that you do not have a plan or roadmap for superintelligence misalignment (even if you don’t think you should have one). If nothing else, this is the kind of thing your leadership should be made aware of explicitly, so they can either adjust that or use it in their own public communications to try and reduce race dynamics.
It does match my general experience with moderate tactical projects (say, projects that involve up to about 10 person-years of research effort). But not for large complex important projects.
Okay, would you like to bet on whether some of the largest research programs had plans going into them? I haven’t checked, but I would put at least 10:1 odds that if we pick say 3 projects like Apollo Program, Manhattan Project, and others on a similar scale and type they will all have had a high level roadmap of things to try which could plausibly address the core challenges quite early on[1], even if a lot of details ended up changing when they ran into reality.
There is also a 100+ page paper that I linked in the original post, that goes into a fair amount of detail on what the various risks and mitigations might look like.
When I ask a plain no special prompting history off AI to summarize, it says:
(detailed analysis of non-superintelligence focused bits)Is there a different document which does focus on either different approaches which are aimed at superintelligence, or analyzing whether these approaches are actually fit for that challenge? Or is this summary incorrect, in a way it would be much easier for you to point out and quote the relevant sections, as an author of the paper, than me as someone who would have to read it from scratch and also currently does not expect to find things which explicitly address the most difficult bottleneck in those 100 pages.
(I am genuinely glad you’re engaging, but I am not reassured so far, and encourage you to look at the stack of how you’re evaluating this specific concern I’m raising and see if you’re running a truth-seeking process which would, if I had a fair point, be able to notice)
- ^
Let’s say a collection of core technical problems to be solved, and a set of plausible solutions to try (perhaps all of which were discarded, but were a starting point for exploration).
- ^
Hum, I usually expect that large complex important projects should have a roadmap, some sketch of the future that goes well with details to fill in. The more detailed it is, the more we check it for consistency and likelihood to work. Does this match you general experience with planning projects trying to achieve a goal?
What you say there looks like an extremely vague and high level roadmap that sounds to me like ‘we’ll figure out our plan as we go as data comes in’, plus automated alignment.
I would be really enthusiastic for you and your team to try unblurring that roadmap, and seeing what difficulties you find at superintelligence level on the current path.
Not necessary, there will be a wide array of learning objectives and some will have maths or AI prerequisites, but not all. Being a strong truth-seeker and fast learner is far more important than domain knowledge.
If I were to make statements like that (which I haven’t exactly), I would be referring to superintelligence misalignment risks specifically, as that seems like by far the tightest bottleneck on surviving futures. The linked paper says:
To address misalignment, we outline two lines of defense. First, model-level mitigations such as amplified oversight and robust training can help to build an aligned model. Second, system-level security measures such as monitoring and access control can mitigate harm even if the model is misaligned.
Which don’t seem like the class of approach which could be sufficient for handling superintelligence-level optimization, for reasons I’m sure you’re tracking given you later say:
“This means that our approach covers conversational systems, agentic systems, reasoning, learned novel concepts, and some aspects of recursive improvement, while setting aside goal drift and novel risks from superintelligence as future work.”
Do you have a plan for superintelligence misalignment risks?
Managed vs Unmanaged Agency
The sourcefor footnote 4 is a shortform by me,which was specifically about Ayahuasca which is much more likely than other psychedelics to have the described effects, though many go in that direction.Edit: oops overconfident! Many people can figure out the same thing, models of reality are convergent because there is a reality to study.
Cool, I went with the most modern Glass, Enterprise 2 for the higher RAM and other hardware spec stuff, figuring that software would work itself out these days.
Okay, I’m in. Bought one. Any ideas or code or tips that you think it’s worth sharing, here or by DM?
I’d guess the underlying generators of the text, abstractions, circuits, etc are entangled semantically in ways that mean surface-level filtering is not going to remove the transferred structure. Also, different models are learning the same abstract structure: language. So these entanglements would be expected to transfer over fairly well.
Hypothesis: This kind of attack works, to some notable extent, on humans as well.
This is going to be a fun rest of the timeline.
“I’m less worried about climate change than climate removal.”
If you write a condensed and better named version of this, Lens Academy will use it in the flagship course. p(>0.95)
It doesn’t persist on FF. The switch to Feed does, but not the settings.
real AI safety is a nonexistent field.
trying to fix this. If you wanna help us promote (once ready, need to prep docs) or visit that would be cool.
(also superintelligence alignment is not totally entirely nonexistent as a field, i think there’s at least a few dozen people directly trying to tackle the main bottleneck, plus a bunch more banking on automated alignment which is super doomy if you don’t understand alignment enough to direct your research system but still attempting a thing)
If my hypothesis is correct: The poison is the type of circuit implied by the data, good enough mech interp to pick out that circuit on a model trained on the dataset is needed to identify poison, because the poison requires gradient descent → SLT finding singularities / groking to actualize, as it’s non-trivially entangled with the dataset. Possibly Algorithmic Information Theory people might have some neater tricks than just train a model then inspect it, but I’d guess that’s the easiest way.