Masters student in Physics at the University of Queensland.
I am interested in Quantum Computing, physical AI Safety guarantees and alignment techniques that will work beyond contemporary models.
Masters student in Physics at the University of Queensland.
I am interested in Quantum Computing, physical AI Safety guarantees and alignment techniques that will work beyond contemporary models.
I dislike when conversations about that are really about one topic get muddied by discussion about an analogy. For the sake of clarity, I’ll use italics relate statements when talking about the AI safety jobs at capabilities companies.
Interesting perspective. At least one other person also had a problem with that statement, so it is probably worth me expanding.
Assume, for the sake of the argument, that the Environmental Manager’s job is to assist with clean-ups after disasters, monitoring for excessive emissions and preventing environmental damage. In a vacuum these are all wonderful, somewhat-EA aligned tasks.
Similarly the safety focused role, in a vacuum, is mitigating concrete harms from prosaic systems and, in the future, may be directly mitigating existential risk.
However, when we zoom out and look at these jobs in the context of the larger organisations goals, things are less obviously clear. The good you do helps fuel a machine whose overall goals are harmful.
The good that you do is profitable for the company that hires you. This isn’t always a bad thing, but by allowing BP to operate in a more environmentally friendly manner you improve BP’s public relations and help to soften or reduce regulation BP faces.
Making contemporary AI systems safer, reducing harm in the short term, potentially reduces the regulatory hurdles that these companies face. It is harder to push restrictive legislation governing the operation of AI capabilities companies if they have good PR.
More explicitly, the short-term, environmental management that you do on may hide more long-term, disastrous damage. Programs to protect workers and locals from toxic chemical exposure around an exploration site help keep the overall business viable. While the techniques you develop shield the local environment from direct harm, you are not shielding the globe from the harmful impact of pollution.
Alignment and safety research at capabilities companies focuses on today’s models, which are not generally intelligent. You are forced to assume that the techniques you develop will extend to systems that are generally intelligent, deployed in the real world and capable of being an existential threat.
Meanwhile the techniques used to align contemporary systems absolutely improve their economic viability and indirectly mean more money is funnelled towards AGI research.
“everyday common terms such as tool, function/purpose, agent, perception”
I suspect getting the “true name” of these terms would get you a third of the way to resolving ai safety.
Firstly, some form of visible disclaimer may be appropriate if you want to continue listing these jobs.
While the jobs board may not be “conceptualized” as endorsing organisations, I think some users will see jobs from OpenAI listed on the job board as at least a partial, implicit endorsement of OpenAI’s mission.
Secondly, I don’t think roles being directly related to safety or security should be a sufficient condition to list roles from an organisation, even if the roles are opportunities to do good work.
I think this is easier to see if we move away from the AI Safety space. Would it be appropriate for 80,000 Hours job board advertise an Environmental Manager job from British Petroleum?
Just started using this, great recommendation. I like the night mode feature which changes the color of the pdf itself.
I think this experiment does not update me substantially towards thinking we are closer toward AGI because the experiment does not show GPT-4o coming up with a strategy to solve the task and then executing it. Rather a human (a general intelligence) has looked at the benchmark then devised an algorithm that will let GPT-4o perform well on the task.
Further, the method does not seem flexible enough to work on a diverse range of tasks and certainly not without human involvement in adapting it.
In other words, the result is less that GPT-4o is able to achieve 50% on ARC-AGI. It is that a human familiar with the style of question used in ARC-AGI can devise a method for getting 50% on ARC-AGI that offloads some of the workload to GPT-4o.
The sequels obviously include a lot of stuff relating to aliens, but a big focus is on how human group react to the various dangerous scenarios they now face. Much of the books are concerned with how human culture evolves given the circumstances, with numerous multi-generational time-skips.
Updating to say that I just finished the short story “Exhalation” by Ted Chiang and it was absolutely exceptional!
I was immediately compelled to share it with some friends who are also into sci-fi.
Cool list, I’m going to start reading Ted Chiang.
Some thoughts
Permutation City
”To be blunt, Egan is not a great author, and this book is mostly his excuse to elucidate some ideas in philosophy.”
You are being, if anything, too nice to Greg Egan’s writing here. I think 4⁄10 is extremely charitable.
But if you enjoyed the hard sci-fi elements you’ll probably also enjoy “Diaspora”. Even the errata for this book make for a fun read and show you the level of care Egan puts into trying to make the science realistic.
The Three Body Problem
The two other books in the series (particularly The Dark Forest) are very interesting and have a much wider scope which gives Liu a lot of space for world-building. There’s also a fair bit of commentary on societal cultural evolution which you might enjoy if you enjoyed the non-western perspective of the first book.
A fair warning about the readability of The Dark Forest. Liu’s editor somehow let him keep in some crushingly boring material.
Death’s End is extremely wide in scope and faster paced. But I think you might hate the more fantastical sci-fi elements.
Obvious and “shallow” suggestion. Whoever goes on needs to be “classically charismatic” to appeal to a mainstream audience.
Potentially this means someone from policy rather than technical research.
“I assumed the idea here was that AGI has a different mind architecture and thus also has different internal concepts for reflection.”
It is not just the internal architecture. An AGI will have a completely different set of actuators and sensors compared to humans.
A suggestion to blacklist anyone who decided to give $30 million (a paltry sum of money for a startup) to OpenAI.
I agree with many of the points you have made in this post, but I strongly disagree with the characterisation of $30 million as a “paltry sum”.
1. My limited research indicates that $30 million was likely a significant amount of money for OpenAI at the time
I haven’t been able to find internal financial reports from 2017 OpenAI* but the following quote from wikipedia describes OpenAI’s operating expenses in that year.
”In 2017, OpenAI spent $7.9 million, or a quarter of its functional expenses, on cloud computing alone”
So, while OpenAI is currently worth tens of billions, $30 million appears to have been a significant sum for them in 2017.
Again, I haven’t been able to find internal financial reports (not claiming they aren’t available).
My understanding is Open Phil would have access to reports which would show that $30 million was or wasn’t a significant amount of money at the time, although they’re probably bound by confidentiality agreements which would forbid them from sharing.
2. $30 million was (and still is) a substantial amount of money for AI Safety Research.
This can be seen by simply looking at the financial reports of various safety orgs. In my original shortform post I believe I compared that amount to a few years of MIRI’s operating expenses.
But you can take your pick of safety orgs and you’ll see that $30 million buys you a lot. AI Safety researchers are (relatively) cheap.
This was a great reply. In responding to it my confidence in my arguments declined substantially.
I’m going to make what I think is a very cruxy high level clarification and then address individual points.
High Level Clarification
My original post has clearly done a poor job at explaining why I think the mismatch between the optimisation definition given in RFLO and evolution matters. I think clarifying my position will address the bulk of your concerns.
“I don’t see why this matters for anything, right? Did Eliezer or Nate or whoever make some point about evolution which is undermined by the observation that evolution is not a separate system?
[...]
Yet again, I don’t see why this matters for anything.”
I believe you have interpreted the high level motivation behind my post to be something along the lines of “evolution doesn’t fit this definition of optimisation, and therefore this should be a reason to doubt the conclusions of Nate, Eliezer or anyone else invoking evolution.”
This is a completely fair reading of my original post, but it wasn’t my intended message.
I’m concerned that AI Safety research lacks a sufficiently robust framework for reasoning about the development, deployment and behaviour of AGI’s. I am very interested in the broad category of “deconfusion”. It is under that lens that I comment on evolution not fitting the definition in RFLO. It indicates that the optimiser framework in RFLO may not be cutting reality at the joints, and a more careful treatment is needed.
I’m going to immediately edit my original post to make this more clear thanks to your feedback!
Detailed Responses to Individual Points
“And also, sometimes a difference between X and Y is totally irrelevant to the point that someone was making with their analogy”
I agree. I mentioned Nate’s evolution analogy because I think it wasn’t needed to make the point and lead to confusion. I don’t think the properties of evolution I’ve mentioned can be used to argue against the Sharp Left Turn.
“If you just want to list any property that Evolution does not share with ML optimisation, here are a bunch: (1) The first instantiation of Evolution on Earth was billions of years earlier than the first instantiation of ML optimisation. (2) Evolution was centrally involved in the fact that I have toenails, whereas ML optimisation was not. (3) ML optimisation can be GPU-accelerated using PyTorch, whereas Evolution cannot … I could go on all day! Of course, none of these differences are relevant to anything. That’s my point. I don’t think the three differences you list are relevant to anything either.”
Keeping in mind the “deconfusion” lens that motivated my original post, I don’t think these distinctions point to any flaws in the definition of optimisation given in RFLO, in the same way that evolution failing to satisfy the criteria of having an internally represented objective does.
”I don’t even know why the RFLO paper put that criterion in. Like, let’s compare (1) an algorithm training an LLM with the “explicit” objective function of minimizing perplexity, written in Python, (2) some super-accelerated clever hardware implementation that manages to perform the exact same weight updates but in a way that doesn’t involve the objective function ever being “explicitly represented” or calculated.”
I don’t have any great insight here, but that’s very interesting to think about. I would guess that “clever hardware implementation that performs the exact same weight updates” without an explicitly represented objective function ends up being wildly inefficient. This seems broadly similar to the relationship between a search algorithm and an implementation of a the same algorithm that is simply a gigantic pre-computed lookup table.
“Separately, I wonder whether you would say that AlphaZero self-play training is an “optimizer” or not. Some of your points seem to apply to it—in particular, since it’s self-play, the opponent is different each step, and thus so is the “optimal” behavior. You could say “the objective is always checkmate”, but the behavior leading to that objective may keep changing; by the same token you could say “the objective is always inclusive genetic fitness”, but the behavior leading to that objective may keep changing. (Granted, AlphaZero self-play training in fact converges to very impressive play rather than looping around in circles, but I think that’s an interesting observation rather than an a priori theoretical requirement of the setup—given that the model obviously doesn’t have the information capacity to learn actual perfect play.)”
Honestly, I think this example has caused me to lose substantial confidence in my original argument.
Clearly,the AlphaZero training process should fit under any reasonable definition of optimisation and as you point out there is no reason fundamental reason a similar training process on a variant game couldn’t get stuck in a loop.
The only distinction I can think of is that the definition of “checkmate” is essentially a function of board state and that function is internally represented in the system as a set of conditions. This means you can point to an internal representation and alter it by explicitly changing certain bits.
In contrast, evolution is stuck optimising for genes which are good at (directly or indirectly) getting passed on.
I guess the equivalent of changing the checkmate rules would be changing the environment to tweak which organisms tend to evolve. But the environment doesn’t provide an explicit representation.
To conclude
I’m fairly confident “explicit internal representation” part of the optimisation definition in RFLO needs tweaking.
I had previously been tossing around the idea that evolution was sort of it’s own thing that was meaningfully distinct from other things called optimisers, but the Alpha Go example has scuttled that idea.
This is somewhat along the lines of the point I was trying to make with the Lazy River analogy.
I think the crux is that I’m arguing that because the “target” that evolution appears to be evolving towards is dependent on the state and differs as the state changes, it doesn’t seem right to refer to it as “internally represented”.
There are meaningful distinctions between evolution and other processes referred to as “optimisers”
People should be substantially more careful about invoking evolution as an analogy for the development of AGI, as tempting as this comparison is to make.
“Risks From Learned Optimisation” is one of the most influential AI Safety papers ever written, so I’m going to use it’s framework for defining optimisation.
“We will say that a system is an optimiser if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system” ~Hubinger et al (2019)
It’s worth noting that the authors of this paper do consider evolution to be an example of optimisation (something stated explicitly in the paper). Despite this, I’m going to argue the definition shouldn’t apply to evolution.
2 strong (and 1 weak) Arguments That Evolution Doesn’t Fit This Definition:
Weak Argument 0:
Evolution itself isn’t a separate system that is optimising for something. (Micro)evolution is the change in allele frequency over generations. There is no separate entity you can point to and call “evolution”.
Consider how different this is from a human engaged in optimisation to design a bottle cap. We have the system that optimises, and the system that is optimised.
It is tempting to say “the system optimises itself” but then go ahead and define the system you would say is engaged in optimisation. That system isn’t “evolution” but is instead something like “the environment”, “all carbon based structures complexes on earth” or “all matter on the surface of earth” etc.
Strong Argument 1:
Evolution does not have an explicitly represented objective function.
This is a major issue. When I’m training a model against a loss function I can explicitly represent that loss function. It is possible to physically implement
There is no single explicit representation of what “fitness” is within our environment.
Strong Argument 2:
Evolution isn’t a “conservative” process. The thing that it is optimising “toward” is dependent on the current state of the environment, and changes over time. It is possible for evolution to get caught in “loops” or “cycles”.
- A refresher on conservative fields.
In physics a conservative vector field is vector field that can be understood of the gradient of some other function. By associating any point in that vector field with the corresponding point on the other function, you meaningfully order each point in your field.
To be less abstract, imagine your field is “slope” which describes the gradient of a mountain range. You can talk meaningfully order the points in the slope field by the height of the point they correspond to on the mountain range.
In a conservative vector field, the curl everywhere is zero. Letting a ball roll down the mountain range (with a very high amount of friction) and the ball will find its way to a local minima and stop.
In a non-conservative vector field it is possible to create paths that loop forever.
My local theme-park has a ride called the “Lazy River” which is an artificial river which has been formed into a loop. There is no change in elevation, and the water is kept flowing clockwise by a series of underwater fans which continuously put energy into the system. Families hire floating platforms and can drift endlessly in a circle until their children get bored.
If you throw a ball into the Lazy River it will circle endlessly. If we write down a vector field that describes the force on the ball at any point in the river, it isn’t possible to describe this field as the gradient of another field. There is no absolute ordering of points in this field.
- Evolution isn’t conservative
In the ball rolling over the hills, we might be able to say that as time evolves it seems to be getting “lower”. By appealing to the function that it is the gradient of, we can meaningfully say if two points are higher, lower or the same height.
In the lazy river, this is no longer possible. Locally, you could describe the motion of the ball as rolling down a hill, but continuing this process around the entire loop tells you that you are describing an impossible MC Escher Waterfall.
If evolution is not conservative (and hence has no underlying goal it is optimising toward) then it would be possible to observe creatures evolving in circles, stuck in “loops”. Evolving, losing then re-evolving the same structures.
This is not only possible, but it has been observed. The side-blotched lizard appear to shift throat colours in a cyclic repeating pattern. For more details, See this talk by John Baez.
To summarise, the “direction” or “thing evolution is optimising toward” cannot be some internally represented thing, because the thing it optimises toward is a function of not just the environment but also of the things evolving in that environment.
Who cares?
Using evolution as an example of “optimisation” is incredible common among AI safety researchers, and can be found in Yudkowsky’s writing on Evolution in The Sequences.
I think the notion of evolution as an optimiser can do more harm than good.
As a concrete example, Nate’s “Sharp Left Turn” post was weakened substantially by invoking an evolution analogy, which spawned a lengthy debate (see Pope, 2023 and the response from Zvi). This issue could have been skipped entirely simply by arguing in favour of the Sharp Left Turn without any reference to evolution (see my upcoming post on this topic).
Clarification Edit:
Further I’m concerned that AI Safety research lacks a sufficiently robust framework for reasoning about the development, deployment and behaviour of AGI’s. I am very interested in the broad category of “deconfusion”. It is under that lens that I comment on evolution not fitting the definition in RFLO. It indicates that the optimiser framework in RFLO may not be cutting reality at the joints, and a more careful treatment is needed.
To conclude
An intentionally provocative and attention-grabbing summary of this post might be “evolution is not an optimiser”, but that is essentially just a semantic argument and isn’t quite what I’m trying to say.
A better summary is “in the category of things that are referred to as optimisation, evolution has numerous properties that it does not share with ML optimisation, so be careful about invoking it as an analogy”.
On similarities between ML similar to RLHF and evolution:
You might notice that any form of ML that relies on human feedback, also fails to have an “internal representation” of what it’s optimising toward, instead getting feedback from humans assessing it’s performance.
Like evolution, it is also possible to set up this optimisation process so that it is also not “conservative”.
A contrived example of this:
Consider training a language model to complete text if the humans giving feedback exhibited a preference for text that was a function of what they’d just read. If the model outputs dense, scientific jargon the humans prefer lighter prose. If the models output light prose, the humans prefer more formal writing etc.
(This is a draft of a post, very keen for feedback and disagreement)
That makes sense.
I guess the followup question is “how were Anthropic able to cultivate the impression that they were safety focused if they had only made an extremely loose offhand commitment?”
Certainly the impression I had from how integrated they are in the EA community was that they had made a more serious commitment.
This post confuses me.
Am I correct that the implied implication here is that assurances from a non-rationalist are essentially worthless?
I think it is also wrong to imply that Anthropic have violated their commitment simply because they didn’t rationally think through the implications of their commitment when they made it.
I think you can understand Anthropic’s actions as purely rational, just not very ethical.
They made an unenforceable commitment to not push capabilities when it directly benefited them. Now that it is more beneficial to drop the facade, they are doing so.
I think “don’t trust assurances from non-rationalists” is not a good takeaway. Rather it should be “don’t trust unenforceable assurances from people who will stand to greatly benefit from violating your trust at a later date”.
I agree that it is certainly morally wrong to post this if that is the persons real full name.
It is less bad, but still dubious, to post someones traumatic life story on the internet even under a pseudonym.
At the risk of missing something obvious, any distributed quantum circuit without a measurement step it is not possible for Kevin and Charlie to learn anything about the plaintext per the no cloning theorem.
Eavesdropping in the middle of the circuit should lead to measurable statistical anomalies due to projecting the state onto the measurement basis.
(I’ll add a caveat that I am talking about theoretical quantum circuits and ignoring any nuances that emerge from their physical implementations.)
Edit:
On posting, I think I realize my error.
We need Kevin and Charlie to not have knowledge of the specific gates that they are implementing as well.
Do you know if there have been any concrete implications (ie. someone giving Daniel a substantial amount of money) from the discussion?′
I agree with your overall point re: 80k hours, but I think my model of how this works differs somewhat from yours.
“But you can’t leverage that into getting the machine to do something different- that would immediately zero out your status/cooperation score.”
The machines are groups of humans, so the degree to which you can change the overall behaviour depends on a few things.
1) The type of status (which as you hint, is not always fungible).
If you’re widely considered to be someone who is great at predicting future trends and risks, other humans in the organisation will be more willing to follow when you suggest a new course of action. If you’ve acquired status by being very good at one particular niche task, people won’t necessarily value your bold suggestion for changing the organisations direction.
2) Strategic congruence.
Some companies in history have successfully pivoted their business model (the example that comes to mind is Nokia). This transition is possible because while the machine is operating in a new way, the end goal of the machine remains the same (make money). If your suggested course of action conflicts with the overall goals of the machine, you will have more trouble changing the machine.
3) Structure of the machine.
Some decision making structures give specific individuals a high degree of autonomy over the direction of the machine. In those instances, having a lot of status among a small group may be enough for you to exercise a high degree of control (or get yourself placed in a decision making role).
Of course, each of these variables all interact with each other in complex ways.
Sam Altman’s high personal status as an excellent leader and decision maker, combined with his strategic alignment to making lots of money, meant that he was able to out-manoeuvre a more safety focused board when he came into apparent conflict with the machine.